Donald Hobson

MMath Cambridge. Currently studying postgrad at Edinburgh.


Neural Networks, More than you wanted to Show
Logical Counterfactuals and Proposition graphs

Wiki Contributions


I think the human level of understanding is a factor, and of some importance. But I strongly suspect the exact level of human understanding is of less importance than exactly what expert we summon. 

Yeah, probably. However, note that it can only use this channel if a human has deliberately made an optimization channel that connects in to this process. Ie the AI isn't allowed to invent DNA printers itself.

I think a bigger flaw is where one human decided to make a channel from A to B, another human made a channel from B to C ... until in total there is a channel from A to Z that no human wants and no human knows exists, built entirely out of parts that humans build. 

Ie person 1 decides the AI should be able to access the internet. Person 2 decided that anyone on the internet should be able to run arbitrary code on their programming website, and the AI puts those together, even when no human did. Is that a failure of this design? Not sure. Can't get a clear picture until I have actual maths.

I am trying to write something that would make sense if I had as solid and mathy idea of "optimization here" as I do with "information here". 

Viruses are optimizing their own spread, not killing all humans. This seems to be further optimizing an already highly optimized artifact, not flowing optimization through an optimized channel. 

I am not sure, I think it depends on why the AI wants the shockwave. Again, all I have is a fuzzy intuition that says yes in some cases, no in others, and shrugs in a lot of cases. I am trying to figure out if I can get this into formal maths. And if I succeed, I will (probably, unless infohazard or something) describe the formal maths. 

Alignment crazy idea. Only run optimization power through channels that have been optimized to convey it.  

Like water that flows through pipes, but doesn't escape from leaks. 

Suppose the AI is connected to a robot body. The AI can optimize along the wires, and through the motors. Optimization power can flow along these channels because humans deliberately optimized them to be good at conveying optimization power. But the AI can't use row-hammer. Humans didn't deliberately optimize memory modules to be susceptible. They just happen to be because of physics. Thus the electric interference between memory locations is a channel that optimization power can flow through, but it was not itself optimized to be good at transmitting optimization power. Thus the AI isn't allowed to use it.

An AI with a 50% chance to output an alignment paper in response to a prompt asking for on can, at worst, loose 1 bit of predictive power for every time that such a prompt appears in the training distribution and isn't followed by a solution. 

If it really was generalizing well from the training dataset, it might realize that anything claiming to be from the future is fiction. After all, the AI never saw anything from beyond 2023 (or whenever it's trained) in it's training dataset. 

If the AI has this highly sophisticated world model, it will know those fake newspaper articles were fake. Given the amount of fiction set in the future, adding a little more won't do anything. 

So these scenarios are asking the LLM to develop extremely sophisticated long term world models and models of future ASI, that are used predicatively exactly nowhere in the training dataset and might at best reduce error by a 1 bit in obscure circumstances.

So actually, this is about generalization out of training distribution. 


The direction I was thinking is that ChatGPT and similar seem to consist of a huge number of simple rules of thumb that usually work. 

I was thinking of an artifact highly optimized, but not particularly optimizing. A vast collection of shallow rules for translating text to maths queries. 

I was also kind of thinking of asking for known chunks of the problem. Asking it to do work on tasky AI, and logical counterfactual and transparency tools. Like each individual paper is something Miri could produce in a year. But you are producing several an hour. 

If we are going to do the "random model chooses each token" trick. First use different quantum random starting weights for each network. Give each network a slightly different selection of layer sizes and training data, and sandbox them from each other. 

Which of the 2 places are you most worried about containing mesaoptimizers? The language model or the maths model?

If you are getting proofs out of your system, you want to get a formal proof, as well as a human legible proof. (And get a human to read the formal theorem being proved, if not the proof.)

Decomposed tasky AI's are pretty useful. Given we don't yet know how to build powerful agents, they are better than nothing. This is entirely consistent with a world where, once agenty AI is developed, it beats the pants of tasky AI. 

However, unlike in Chess games, humans can and will use all the tools at their disposal, including many tools (e.g., code-completion engines, optimizers for protein folding, etc..) that are currently classified as “Artificial Intelligence”.


Lets suppose that both the human and long term AI have a copy of chatGPT. However, as many of us has found, chatGPT is somewhat fickle, it doesn't reliably do what we actually want it to do. We are having short term, non-catastrophic alignment problems. But they do make the tool significantly less useful. 

Does the long term AI suffer from the same problems? Quite possibly not if the chatGPT like capabilities are sufficiently integrated into the model.

A third task listed is “social manipulation.” Here we must admit we are skeptical. Anyone who has ever tried to convince a dog to part with a bone or a child with a toy could attest to the diminishing returns that an intelligence advantage has in such a situation. 


Try convincing a rock to do something by arguing with it. The rock remains supremely unconvinced. You are much smarter than a rock. 

In order to be convinced to do something, there needs to be sufficient complex structure to be capable of being convinced. This is the same reason that sophisticated malware can't run on simple analogue circuits. 

Dogs aren't capable of being motivated by sophisticated philosophical arguments. 

Of course, humans can get dogs to do all sorts of things through years of training. 

Added to that, a human trying to part a bone from a dog isn't exactly applying the full intellectual power humanity can bring to bear. It isn't like a company doing statistics to optimize add click through. 

Also, many of the fastest ways to get a small child to give up a toy might count as child abuse, and are therefore not options that naturally spring to mind. (Ie spinning a tail of terrifying monsters in the toy, that will get the child to drop the toy, run screaming and have nightmares for weeks)

The “loss of control” scenario posits a second phase transition, whereby once AI systems become more powerful, they would not merely enable humans to achieve more objectives quicker but would themselves become as qualitatively superior to humans as humans are to other animals.

I think you are imagining the first blue line, and asking the dotted blue line to justify its increased complexity penalty. Meanwhile, other people are imagining the orange line. 

Load More