Daniel Kokotajlo

Daniel Kokotajlo's Comments

Three Kinds of Competitiveness

Oh right, how could I forget! This makes me very happy. :D

Three Kinds of Competitiveness

Good point about inner alignment problems being a blocker to date-competitiveness for IDA... but aren't they also a blocker to date-competitiveness for every other alignment scheme too pretty much? What alignment schemes don't suffer from this problem?

I'm thinking "Do anything useful that a human with a lot of time can do" is going to be substantially less capable than full-blown superintelligent AGI. However, that's OK, because we can use IDA as a stepping-stone to that. IDA gets us an aligned system substantially more capable than a human, and we use that system to solve the alignment problem and build something even better.

It's interesting how Paul advocates merging cost and performance-competitiveness, and you advocate merging performance and date-competitiveness. I think it's fine to just talk about "competitiveness" full stop, and only bother to specify what we mean more precisely when needed. Sometimes we'll mean one of the three, sometimes two of the three, sometimes all three.

Three Kinds of Competitiveness

I knew that the goal was to get IDA to be cost-competitive, but I thought current versions of it weren't. But that was just my rough impression; glad to be wrong, since it makes IDA seem even more promising. :) Of all the proposals I've heard of, IDA seems to have the best combination of cost, date, and performance-competitiveness.

Three Kinds of Competitiveness

I agree this may be true in most cases, but the chance of it not being true for AI is large enough to motivate the distinction. Besides, not all cases in which performance and cost can be traded off are the same; in some scenarios the "price" of performance is very high whereas in other scenarios it is low. (e.g. in Gradual Economic Takeover, let's say, a system being twice as qualitatively intelligent could be equivalent to being a quarter the price. Whereas in Final Conflict, a system twice as qualitatively intelligent would be equivalent to being one percent the price.) So if we are thinking of a system as "competitive with X% overhead," well, X% is going to vary tremendously depending on which scenario is realized. Seems worth saying e.g. "costs Y% more compute, but is Z% more capable."

Three Kinds of Competitiveness

Mmm, nice. Thanks! I like your distinction also. I think yours is sufficiently different that we shouldn't see the two sets of distinctions as competing.* A system which has an objective which would be capable on paper but isn't capable in practice due to inner-alignment failures would be performance-uncompetitive but objective-competitive. For this reason I think we shouldn't equate objective and performance competitiveness.

If operating an AI system turns out to be an important part of the cost, then cost+date competitiveness would turn out to be different from training competitiveness, because cost competitiveness includes whatever the relevant costs are. However I expect operating costs will be much less relevant to controlling the future than costs incurred during the creation of the system (all that training, data-gathering, infrastructure building, etc.) so I think the mapping between cost+date competitiveness and training competitiveness basically works.

*Insofar as they are competing, I still prefer mine; as you say, it applies to more than just prosaic AI alignment proposals. Moreover, it makes it easier for us to talk about competitions as well, e.g. "In the FOOM scenario we need to win a date competition; cost-competitiveness still matters but not as much." Moreover cost, performance, and date are fairly self-explanatory terms, whereas as you point out "objective" is more opaque. Moreover I think it's worth distinguishing between cost and date competitiveness; in some scenarios one will be much more important than the other, and of course the two kinds of competitiveness vary independently in AI safety schemes (indeed maybe they are mildly anti-correlated? Some schemes are fairly well-defined and codified already, but would require tons of compute, whereas other schemes are more vague and thus would require tons of tweaking and cautious testing to get right, but don't take that much compute. I do like how your version maps more onto the inner vs. outer alignment distinction.

Three Kinds of Competitiveness

Some thoughts that came to me after I wrote this post:

--I'm not sure I should define date-competitive the way I do. Maybe instead of "can be built" it should be "is built." If we go the latter route, the FOOM scenario is an extremely intense date competition. If we go the former route, the FOOM scenario is not necessarily an intense date competition; it depends on what other factors are at play. For example, maybe there are only a few major AI projects and all of them are pretty socially responsible, so a design is more likely to win if it can be built sooner, but it won't necessarily win; maybe cooler heads will prevail and build a safer design instead.

--Why is date-competitiveness worth calling a kind of competitiveness at all? Why not just say: "We want our AI safety scheme/design to be cost- and performance-competitive, and also we need to be able to build it fairly quickly compared to the other stuff that gets built." Well, 1. Even that is clunky and awkward compared to the elegant "...and also date-competitive." 2. It really does have the comparative flavor of competition to it; what matters is not how long it takes us to complete our safety scheme, but how long it takes relative to unaligned schemes, and it's not as simple as just "we need to be first," rather it's that sooner is better but doing it later isn't necessarily game over... 3. It seems to be useful for describing date competitions, which are important to distinguish from situations which are not date competitions or less so. (Aside: A classic criticism of the "Let's build uploads first, and upload people we trust" strategy is that neuromorphic AI will probably come before uploads. In other words, this strategy is not date-competitive.)

--I'm toying with the idea of adding "alignment-competitiveness" (meaning, as aligned or more aligned than competing systems) and "alignment competition" to the set of definitions. This sounds silly, but it would be conceptually neat, because then we can say: We hope for scenarios in which control of the future is a very intense alignment competition, and we are working hard to make it that way. "

Any rebuttals of Christiano and AI Impacts on takeoff speeds?

Just wanna say, I intend to get around to writing rebuttals someday. I definitely have several counterarguments in mind; the forceful takedowns you mention weren't very convincing to me, though they did make me update away from fast takeoff.

Benito's Shortform Feed

Well, that wasn't the scenario I had in mind. The scenario I had in mind was: People in the year 2030 pass a law requiring future governments to make ancestor simulations with happy afterlives, because that way it's probable that they themselves will be in such a simulation. (It's like cryonics, but cheaper!) Then, hundreds or billions of years later, the future government carries out the plan, as required by law.

Not saying this is what we should do, just saying it's a decision I could sympathize with, and I imagine it's a decision some fraction of people would make, if they thought it was an option.

Benito's Shortform Feed

I'm not sure it makes sense either, but I don't think it is accurately described as "cause yourself to believe false things." I think whether or not it makes sense comes down to decision theory. If you use evidential decision theory, it makes sense; if you use causal decision theory, it doesn't. If you use functional decision theory, or updateless decision theory, I'm not sure, I'd have to think more about it. (My guess is that updateless decision theory would do it insofar as you care more about yourself than others, and functional decision theory wouldn't do it even then.)

What are the most plausible "AI Safety warning shot" scenarios?

Thanks for this reply. Yes, I was talking about intent alignment warning shots. I agree it would be good to consider smaller warning shots that convince, say, 10% of currently-skeptical people. (I think it is too early to say whether COVID-19 is a 50%-warning shot for existential risk from pandemics. If it does end up killing millions, the societal incompetence necessary to get us to that point will be apparent to most people, I think, and thus most people will be on board with more funding for pandemic preparedness even if before they would have been "meh" about it.) If we are looking at 10%-warning shots, smaller-scale things like you are talking about will be more viable.

(Whereas if we are looking at 50%-warning shots, it seems like at least attempting to take over the world is almost necessary, because otherwise skeptics will say "OK yeah so one bad apple embezzled some funds, that's a far cry from taking over the world. Most AIs behave exactly as intended, and no small group of AIs has the ability to take over the world even if it wanted to.")

I'm not imagining that they all want to take over the world. I was just imagining that minor failures wouldn't be sufficiently convincing to count as 50%-warning shots, and it seems you agree with me on that.

Yes, I think it's true of humans: Almost all humans are incapable of getting even close to taking over the world. There may be a few humans who have a decent shot at it and also the motivation and incaution to try it, but they are a very small fraction. And if they were even more competent than they already are, their shot at it would be more than decent. I think the crux of our whole disagreement here was just the thing you identified right away about 50% vs. 10% warning shots. Obviously there are plenty of humans capable and willing to do evil things, and if doing evil things is enough to count as a warning shot, then yeah it's not true of humans, and neither would it be true of AI.

I think you've also pointed out an unfairness in my definition, which was about single events. A series of separate minor events gradually convincing most skeptics is just as good, and now that you mention it, much more likely. I'll focus on these sorts of things from now on, when I think of warning shots.

Load More