Sammy Martin

Philosophy and Physics, just finished my MSc AI at Edinburgh University. Interested in metaethics, anthropics and technical AI Safety.

Sammy Martin's Comments

[AN #80]: Why AI risk might be solved without additional intervention from longtermists
The biggest disagreement between me and more pessimistic researchers is that I think gradual takeoff is much more likely than discontinuous takeoff (and in fact, the first, third and fourth paragraphs above are quite weak if there's a discontinuous takeoff).

It's been argued before that Continuous is not the same as Slow by any normal standard, so the strategy of 'dealing with things as they come up', while more viable under a continuous scenario, will probably not be sufficient.

It seems to me like you're assuming longtermists are very likely not required at all in a case where progress is continuous. I take continuous to just mean that we're in a world where there won't be sudden jumps in capability, or apparently useless systems suddenly crossing some threshold and becoming superintelligent, not where progress is slow or easy to reverse. We could still pick a completely wrong approach that makes alignment much more difficult and set ourselves on a likely path towards disaster, even if the following is true:

So far as I can tell, the best one-line summary for why we should expect a continuous and not a fast takeoff comes from the interview Paul Christiano gave on the 80k podcast: 'I think if you optimize AI systems for reasoning, it appears much, much earlier.'
So far as I can tell, Paul's point is that absent specific reasons to think otherwise, the prima facie case that any time we are trying hard to optimize for some criteria, we should expect the 'many small changes that add up to one big effect' situation.
Then he goes on to argue that the specific arguments that AGI is a rare case where this isn't true (like nuclear weapons) are either wrong or aren't strong enough to make discontinuous progress plausible.

In a world where continuous but moderately fast takeoff is likely, I can easily imagine doom scenarios that would require long term strategy or conceptual research early on to avoid, even if none of them involve FOOM. Imagine that the accepted standard for aligned AI is follows some particular research agenda, like Cooperative Inverse Reinforcement Learning, but it turns out that CIRL starts to behave pathologically and tries to wirehead itself as it gets more and more capable, and that its a fairly deep flaw that we can only patch and not avoid.

Let's say that over the course of a couple of years failures of CIRL systems start to appear and compound very rapidly until they constitute an Existential disaster. Maybe people realize what's going on, but by then it would be too late, because the right approach would have been to try some other approach to AI alignment but the research to do that doesn't exist and can't be done anywhere near fast enough. Like Paul Christiano's what failure looks like

The Value Definition Problem

I appreciate the summary, though the way you state the VDP isn't quite the way I meant it.

what should our AI system <@try to do@>(@Clarifying "AI Alignment"@), to have the best chance of a positive outcome?

To me, this reads like, 'we have a particular AI, what should we try to get it to do', wheras I meant it as 'what Value Definition should we be building our AI to pursue'. So, that's why I stated it as ' what should we aim to get our AI to want/target/decide/do' or, to be consistent with your way of writing it 'what should we try to get our AI system to do to have the best chance of a positive outcome', not 'what should our AI system try to do to have the best chance of a positive outcome'. Aside from that minor terminological difference, that's a good summary of what I was trying to say.

I fall more on the side of preferring indirect approaches, though by that I mean that we should delegate to future humans, as opposed to defining some particular value-finding mechanism into an AI system that eventually produces a definition of values.

I think your opinion is probably the majority opinion - my major point with the 'scale of directness' was to emphasize that our 'particular value-finding mechanisms' can have more or fewer degrees of freedom, since from a certain perspective 'delegate everything to a simulation of future humans' is also a 'particular mechanism' just with a lot more degrees of freedom, so even if you strongly favour indirect approaches you will still have to make some decisions about the nature of the delegation.

The original reason that I wrote this post was to get people to explicitly notice the point that we will probably have to do some philosophical labour ourselves at some point, and then I discovered Stuart Armstrong had already made a similar argument. I'm currently working on another post (also based on the same work at EA Hotel) with some more specific arguments about why we should construct a particular value-finding mechanism that doesn't fix us to any particular normative ethical theory, but does fix us to an understanding of what values are - something I call a Coherent Extrapolated Framework (CEF). But again, Stuart Armstrong anticipated a lot (but not all!) of what I was going to say.

The Value Definition Problem

Thanks for pointing that out to me; I had not come across your work before! I've had a look through your post and I agree that we're saying similar things. I would say that my 'Value Definition Problem' is an (intentionally) vaguer and broader question about what our research program should be - as I argued in the article, this is mostly an axiological question. Your final statement of the Alignment Problem (informally) is:

A must learn the values of H and H must know enough about A to believe A shares H’s values

while my Value Definition Problem is

“Given that we are trying to solve the Intent Alignment problem for our AI, what should we aim to get our AI to want/target/decide/do, to have the best chance of a positive outcome?”

I would say the VDP is about what our 'guiding principle' or 'target' should be in order to have the best chance of solving the alignment problem. I used Christiano's 'intent alignment' formulation but yours actually fits better with the VDP, I think.