AI ALIGNMENT FORUM
AF

42
The Ancient Geek
030
Message
Dialogue
Subscribe

About Me

Scientist by training, coder by previous session,philosopher by inclination, musician against public demand.

https://theancientgeek.substack.com/?utm_source=substack&utm_medium=web&utm_campaign=substack_profile

Why I am not a Doomer

I'm specifically addressing the argument for a high probability of near extinction (doom) from AI...

Eliezer Yudkowsky: "Many researchers steeped in these issues, including myself, expect that the most likely result of building a superhumanly smart AI, under anything remotely like the current circumstances, is that literally everyone on Earth will die. "

....not whether it is barely possible, or whether other, less bad outcomes (dystopias) are probable. I'm coming from the centre, not the other extreme


Doom, complete or almost complete extinction of humanity, requires  a less than superintelligent  AI  to become superintelligent either very fast , or very surreptitiously ... even though it is starting from a point where it does not have the resources to do either.

The "very fast" version is foom doom...Foom is rapid recursive  self improvement (FOOM is supposed to represent a nuclear explosion)

The classic Foom Doom argument (https://www.greaterwrong.com/posts/kgb58RL88YChkkBNf/the-problem) involves an agentive AI that quickly becomes powerful through recursive self improvement, and has a value/goal system that is unfriendly and incorrigible.


The complete argument for Foom Doom is that:-

  • The AI will have goals/values in the first place (it wont be a passive tool like GPT*),.

  • The values will be misaligned, however subtly, to be unfavorable to humanity.

  • That the  misalignment cannot be detected or corrected.

  • That the AI can achieve value stability under self modification.

  • That the AI will self modify in way too fast to stop.

  • That most misaligned values in the resulting ASI  are highly dangerous (even goals that aren't directly inimical to humans can be a problem for humans, because the AS I might want to director sources away from humans.

  • And that the AI will have extensive opportunities to wreak havoc: biological warfare (custom DNA can be ordered by email), crashing economic systems (trading can be done online), taking over weapon systems, weaponing other technology and so on.

It’s a conjunction of six or seven claims, not just one. ( I say "complete argument " because pro doomers almost always leave out some stages. I am not convinced that rapid self improvement and incorrigibility are both needed, both needed, but I am sure that one or the other is. Doomers need to reject the idea that misalignment can be fixed gradually, as you go along. . A very fast-growing ASI, foom, is way of doing that; and assumption that AI's will resist having their goals changed is another).

Obviously the problem is that to claim a high overall  probability of doom, each claim in the chain needs to have a high probability. It is not enough for some of the stages to be highly probable, all must be.


There are some specific weak points.

Goal stability under self improvement is not a given: it is not possessed by all mental architectures, and may not be possessed by any, since noone knows how to engineer it, and humans appear not to have it.

The Orthogonality Thesis (https://www.lesswrong.com/w/orthogonality-thesis)is sometimes mistakenly called on to support to support goal stability. It implies that a lot of combinations of goals and intelligence levels are possible, but doesn't imply that all possible minds have goals, or that all goal driven agents have fixed, incorrigible goals. There are goalless and corrigible agents in mindspace, too. That's not just an abstract possibility. At the time of writing, 2025, our most advanced AI's, the Large Language Models, are non agentive and corrigible.

It is plausible that an agent would desire to preserve its goals, but the desire to preserve goals does not imply the ability to preserve goals. Therefore, no goal stable system of any complexity exists  on this planet, and goal instability cannot be assumed as a default or given. So the orthogonality thesis is true of momentary combinations of goal and intelligence, given the provisos above, but not necessarily true of stable combinations.

Another thing that doesn't prove incorrigibility or goal stability is von Neumann rationality. Frequently appealed to in MIRI 's early writings , it is an idealised framework for thinking about rationality , that doesn't app!y to humans, and therefore doesn't have to apply to any given mind.


There are arguments that AI's will become agentive because that"s what humans want. Gwerns Branwen's confusingly titled "Why Tool AIs Want to Be Agent AIs" ( https://gwern.net/tool-ai) is an example. This is true, but in more than one sense:-

The basic idea is that humans want agentive AI's because they are more powerful.  And people want power, but not at the expense of control.  Power that you can't control is no good to you. Taking the brakes off a car makes it more powerful, but more likely to kill you. No army wants a weapon that will kill their own soldiers, no financial organisation wants a trading system that makes money for someone else, or gives it away to charity, or causes stick market crashes. The maximum amount    of power and the minimum of control is an explosion.

One needs to look askance at what "agent" means as well. Among other things, it means an entity that acts on behalf of a human -- as in principal/agent.(https://en.m.wikipedia.org/wiki/Principal–agent_problem) An agent is no good to its principal unless it has a good enough idea of its principal's goals. So while people will want agents, they wont want misaligned ones -- misalgined with themselves, that is. Like the Orthogonality Thesis, the argument is not entirely bad news.

Of course, evil governments and corporations controlling obedient superintelligences isn't a particularly optimistic scenario, but it's dystopia, not doom.


Yudkowsky's much repeated argument that safe , well-aligned behaviour is a small target to hit ... could actually be two arguments.

One would be the random potshot version of the Orthogonality Thesis, where there is an even chance of hitting any mind, and therefore a high chance ideas of hitting an eldritch, alien mind. But equiprobability is only one way of turning possibilities into probabilities, and not particularly realistic. Random potshots aren't analogous to the probability density for action of building a certain type of AI, without knowing much about what it would be.

While, many of the minds in mindpsace are indeed weird and unfriendly to humans, that does not make it likely that the AIs we will construct will be. we are deliberately seeking to build certainties of mind for one thing, and have certain limitations, for another.  Current LLM 's are trained in vast copora of human generated content, and inevitably pick up a version of human values from them.

Another interpretation of the Small Target Argument is, again , based on incorrigibility. Corrigibility means you can tweak an AI's goals gradually, as you go on, so there s no need to get them exactly right on the first try.

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No posts to display.
No wikitag contributions to display.
0TAG's Shortform
5y
0
With enough knowledge, any conscious agent acts morally
TAG1mo*00

What type of argument is my argument, from your perspective?

Naturalistic, intrinsically motivating, moral realism.

both can affect action, as it happens in the thought experiment in the post.

Bad-for-others can obviously affect action in an agent that's already altruistic, but you are attempting something much harder , which is bootstrapping altruistic morality from logic and evidence.

more generally, I think morality is about what is important, better/worse, worth doing, worth guiding action

In some objective sense. If torturing an AI only teaches it to avoid things that are bad-for-it, without caring about suffering it doesn't feel, the argument doesn't work.

(My shoulder Yudkowsky is saying "it would exterminate all others agents in order to avoid being tortured again")

If it only learns a self-centered lesson, it hasn't learn morality in your sense, because you've built altruism into your definition of morality. And why wouldn't it learn the self centered lesson? That's where the ambiguity of "bad" comes in. Anyone can agree that the AI would learn that suffering is bad in some sense, and you just assume it's going to be the sense needed to make the argument work.

which is not necessarily tied to obligations or motivation.

If the AI learns morality as a theory as , but doesn't care to act on it, little been achieved.

Reply
With enough knowledge, any conscious agent acts morally
TAG1mo10

Let’s state it distinctly: I think that valenced experience has the property of seeming to matter more than other things to rational conscious agents with valenced perceptions;

This type of argument has the problem that other peoples negative experiences aren't directly motivating in the way that yours are...there's a gap between bad-for-me and morally-wrong.

To say that something is morally-wrong is to say that I have some obligation or motivation to do something about.

A large part of the problem is that the words "bad" and "good" are so ambiguous. For instance, they have aesthetic meanings as well as ethical ones. That allows you to write an argument that appears to derive a normative claim from a descriptive one.

See

https://www.lesswrong.com/posts/HLJGabZ6siFHoC6Nh/sam-harris-and-the-is-ought-gap

Reply
“Endgame safety” for AGI
TAG3y00

Of you course, Hawkins doesn't just say they are stupid. It is Byrnes who is summarily dismissing Hawkins, in fact.

Reply
On corrigibility and its basin
TAG3y00

Corrigibility has various slightly different definitions, but the general rough idea is of an AI that does what we want

An aligned AI will also so what we want because it's also what it wants, its terminal values are also ours.

I've always taken "control" to differ from alignment in that it means an AI doing what we want even if it isn't what it wants, ie it has a terminal value of getting rewards, and our values are instrumental to that, if they figure at all.

And I take corrigibility to mean shaping an AIs values as you go along and therefore an outcome of control.

Reply
Decision Theory
TAG4y00

More generally, the problem is that for formal agents, false antecedents cause nonsensical reasoning

No, it's contradictory assumptions. False but consistent assumptions are dual to consistent-and-true assumptions...so you can only infer a mutually consistent set of propositions from either.

To put it another way, a formal system has no way of knowing what would be true or false for reasons outside itself, so it has no way of reacting to a merely false statement. But a contradiction is definable within a formal system.

To.put it yet another way... contradiction in, contradiction out

Reply
Repeated (and improved) Sleeping Beauty problem
TAG7y00

This statement of the problem concedes that SB is calculating subjective probability. It should be obvious that subjective probabilities can diverge from each and objective probability -- that is what subjective means. It seems to me that the SB paradox is only a paradox if y ou try to do justice to objective and subjective probability in the same calculation.

Reply