Erick Ball

I am currently a nuclear engineer with a focus in nuclear plant safety and probabilistic risk assessment. I am also an aspiring EA, interested in X-risk mitigation and the intersection of science and policy.

Posts

Sorted by New

Wiki Contributions

Comments

Suppose [...] you’ve got this AI system with this really, really good intelligence, which maybe we’ll call it a world model or just general intelligence. And this intelligence can take in any utility function, and optimize it, and you plug in the incorrect utility function, and catastrophe happens.

I've seen various people make the argument that this is not how AI works and it's not how AGI will work--it's basically the old "tool AI" vs "agent AI" debate. But I think the only reason current AI doesn't do this is because we can't make it do this yet: the default customer requirement for a general intelligence is that it should be able to do whatever task the user asks it to do.

So far the ability of AI to understand a request is very limited (poor natural language skills). But once you have an agent that can understand what you're asking, of course you would design it to optimize new objectives on request, bounded of course by some built-in rules about not committing crimes or manipulating people or seizing control of the world (easy, I assume). Otherwise, you'd need to build a new system for every type of goal, and that's basically just narrow AI.

If our superintelligent AI is just a bunch of well developed heuristics, it is unlikely that those heuristics will be generatively strategic enough to engage in super-long-term planning

If the heuristics are optimized for "be able to satisfy requests from humans" and those requests sometimes require long-term planning, then the skill will develop. If it's only good at satisfying simple requests that don't require planning, in what sense is it superintelligent?

This may be a dumb question, but how can you asymptotically guarantee human-level intelligence when the world-models have bounded computation time, and the human is a "computable function" that has no such limit? Is it because the number of Turing machines is infinite?

My concern is that since CDT is not reflectively stable, it may have incentives to create non-CDT agents in order to fulfill instrumental goals.