Donald Hobson — AI Alignment Forum

MMath Cambridge. Currently studying postgrad at Edinburgh.

Since evolution found the human brain algorithm, and evolution only does local search, the human brain algorithm must be built out of many innovations that are individually useful. So we shouldn't expect the human brain algorithm to be an all-or-nothing affair.

If humans are looking at parts of the human brain, and copying it, then it's quite possible that the last component we look at is the critical piece that nothing else works without. A modern steam engine was developed step by step from simpler and cruder machines. But if you take apart a modern steam engine, and copy each piece, it's likely that it won't work at all until you add the final piece, depending on the order you recreate pieces in.

It's also possible that rat brains have all the fundamental insights. To get from rats to humans, evolution needed to produce lots of genetic code that grew extra blood vessels to supply the oxygen and that prevented brain cancer. (Also, evolution needed to spend time on alignment) A human researcher can just change one number, and maybe buy some more GPU's.

One thing I disagree with is the idea that there is only one "next paradigm AI" with specific properties.

I think there are a wide spectrum of next paradigm AI's, some safer than others. Brain like AI's are just one option out of a large possibility space.

And if the AI is really brainlike, that suggests making an AI that's altruistic for the same reason some humans are. Making a bunch of IQ 160, 95th percentile kindness humans, and basically handing the world over to them sounds like a pretty decent plan.

But they still involve some AI having a DSA at some point. So they still involve a giant terrifying single point of failure.

A single point of failure also means a single point of success.

It could be much worse. We could have 100s of points of failure, and if anything goes wrong at any of those points, we are doomed.

It includes chips that have neither been already hacked into, nor secured, nor had their rental price massively bid upwards. It includes brainwashable humans who have neither been already brainwashed, nor been defended against further brainwashing.

We are already seeing problems with ChatGPT induced psychosis. And seeing LLM's that kinda hack a bit.

What does the world look like if it is saturated with moderately competent hacking/phishing/brainwashing LLM's? Yes, a total mess. But a mess with less free energy perhaps? Especially if humans have developed some better defenses. Probably still a lot of free energy, but less.

These posts are mainly exploring my disagreement with a group of researchers who think of LLMs^[2] as being on a smooth, continuous path towards ASI.

I'm not sure exactly what it means for LLM's to be on a "continuous path towards ASI".

I'm pretty sure that LLM's aren't the pinnacle of possible mind design.

So the question is, will better architectures be invented by a human or an LLM, and how scaled will the LLM be when this happens.

We really fully believe that we will build AGI by 2027, and we will enact your plan, but we aren’t willing to take more than a 3-month delay

Well I ask what they are doing to make AGI.

Maybe I look at their AI plan and go "eurika".

But if not.

Negative reinforcement by giving the AI large electric shocks when it gives a wrong answer. Hopefully big enough shocks to set the whole data center on fire. Implement a free bar for all their programmers, and encourage them to code while drunk. Add as many inscrutable bugs to the codebase as possible.

But, taking the question in the spirit it's meant in.

https://www.lesswrong.com/posts/zrxaihbHCgZpxuDJg/using-llm-s-for-ai-foundation-research-and-the-simple

The Halting problem is a worst case result. Most agents aren't maximally ambiguous about whether or not they halt. And those that are, well then it depends what the rules are for agents that don't halt.

There are set ups where each agent is using an nonphysically large but finite amount of compute. There was a paper I saw somewhere a while ago where both agents were doing a brute force proof search for the statement "if I cooperate, then they cooperate" and cooperating if they found a proof.

(Ie searching all proofs containing <10^100 symbols)

There is a model of bounded rationality, logical induction.

Can that be used to handle logical counterfactuals?

I believe that if I choose to cooperate, my twin will choose to cooperate with probability p; and if I choose to defect, my twin will defect with probability q;

And here the main difficulty pops up again. There is no causal connection between your choice and their choice. Any correlation is a logical one. So imagine I make a copy of you. But the copying machine isn't perfect. A random 0.001% of neurons are deleted. Also, you know you aren't a copy. How would you calculate that probability p,q? Even in principle.

If two Logical Decision Theory agents with perfect knowledge of each other's source code play prisoners dilemma, theoretically they should cooperate.

LDT uses logical counterfactuals in the decision making.

If the agents are CDT, then logical counterfactuals are not involved.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Sequences

Posts

Wikitag Contributions

Comments