Donald Hobson

MMath Cambridge. Currently studying postgrad at Edinburgh.

Wiki Contributions


On corrigibility and its basin

Sure, an AI that ignores what you ask, and implements some form of CEV or whatever isn't corrigible. Corrigibility is more following instructions than having your utility function.

Godzilla Strategies

You seem to believe that any plan involving what you call "godzilla strategies" is brittle. This is something I am not confidant in. Someone may find some strategy that can be shown to not be brittle.

Deep Learning Systems Are Not Less Interpretable Than Logic/Probability/Etc

Any network big enough to be interesting is big enough that the programmers don't have the time to write decorative labels. If you had some algorithm that magically produced a bays net with a billion intermediate nodes that accurately did some task, then it would also be an obvious black box. No one will have come up with a list of a billion decorative labels.

autonomy: the missing AGI ingredient?

Retain or forget information and skills over long time scales, in a way that serves its goals.  E.g. if it does forget some things, these should be things that are unusually unlikely to come in handy later.


If memory is cheap, designing it to just remember everything may be a good idea. And there may be some architectural reason why choosing to forget things is hard.

Bits of Optimization Can Only Be Lost Over A Distance

Suppose I send a few lines of code to a remote server. Those lines are enough to bootstrap a superintelligence which goes on to strongly optimize every aspect of the world. This counts as a fairly small amount of optimization power, as the chance of me landing on those lines of code by pure chance isn't that small. 

Why Agent Foundations? An Overly Abstract Explanation

But what if we instead design the system so that the leaked radio signal has zero mutual information with whatever signals are passed around inside the system? Then it doesn’t matter how much optimization pressure an adversary applies, they’re not going to figure out anything about those internal signals via leaked radio.

Flat out wrong. Its quite possible for A and B to have 0 mutual information. But A and B always have mutual information conditional on some C (assuming A and B each have information) Its possible for there to be absolutely no mutual information between any 2 of electricity use, leaked radio, private key. Yet there is mutual information between all 3. So if the adversary knows your electricity use, and can detect the leaked radio, then you know the key.

Against Time in Agent Models

This fails if there are closed timelike curves around. 

There is of course a very general formalism, whereby inputs and outputs are combined into aputs. Physical laws of causality, and restrictions like running on a reversible computer are just restrictions on the subsets of aputs accepted. 

Frame for Take-Off Speeds to inform compute governance & scaling alignment

Well covid was pretty much a massive obvious biorisk disaster. Did it lead to huge amounts of competence and resources being put into pandemic prevention?

My impression is not really. 

I mean I also expect an AI accident that kills a similar number of people to be pretty unlikely. But

Law-Following AI 1: Sequence Introduction and Structure

I think we have a very long track record of embedding our values into law.

I mean you could say that if we haven't figured out how to do it well in the last 10,000 years, maybe don't plan on doing it in the next 10. That's kind of being mean though. 

If you have a functioning arbitration process, can't you just say "don't do bad things" and leave everything down to the arbitration?

I also kind of feel that adding laws is going in the direction of more complexity. And we really want as simple as possible. (Ie the minimal AI that can sit in a MIRI basement and help them figure out the rest of AI theory or something)

If the human still wants to proceed, they can try to:

I was talking about a scenario where the human has never imagined the possibility, and asking if the AI mentions the possibility to the human (knowing the human may change the law to get it)

The human says "cure my cancer". The AI reasons that it can 

  1. Tell the human of a drug that cures its cancer in the conventional sense.
  2. Tell the human about mind uploading, never mentioning the chemical cure.

If the AI picks 2, the human will change the "law" (which isn't the actual law, its just some text file the AI wants to obey). Then the AI can upload the human and the human will have a life the AI judges as overall better for them. 

You don't want the AI to never mention a really good idea because it happens to be illegal on a technicality. You also don't want all the plans to be "persuade humans to make everything legal, then ..." 

Law-Following AI 1: Sequence Introduction and Structure

First, I explicitly define LFAI to be about compliance with "some defined set of human-originating rules ('laws')." I do not argue that AI should follow all laws, which does indeed seem both hard and unnecessary.

Sure, some  of the failure modes mentioned at the bottom disappear when you do that. 

I think, a stronger claim: that the set of laws worth having AI follow is so small or unimportant as to be not worth trying to follow. That seems unlikely.

If some law is so obviously a good idea in all possible circumstances, the AI will do it whether it is law following or human preference following. 

The question isn't if there are laws that are better than nothing. Its whether we are better encoding what we want the AI to do into laws, or into terms of a utility function. Which format (or maybe some other format) is best for encoding our preferences.


If their objective function is something like the CEV of humanity, any extra laws imposed on top of that are entropic. 

But I would bet that most people, probably including you, would rather live among agents that scrupulously followed the law than agents who paid it no heed and simply pursued their objective functions.

If the AI's have no correlation to human wellbeing in their objectives, the weak correlation given by law following may be better than nothing. If the AI is already strongly correlated with human wellbeing, then any laws imposed are making the AI worse. 

If the AI's estimation in either direction makes a mistake as to what humans' "true" preferences regarding X are, then the humans can decide to change the rules. The law is dynamic, and therefore the deliberative processes that shape it would/could shape an LFAI's constraints.

If the human has never imagined mind uploading, does A go up to the human and explain what it is, asking if maybe that law should be changed? 

Load More