All of Donald Hobson's Comments + Replies

On corrigibility and its basin

Sure, an AI that ignores what you ask, and implements some form of CEV or whatever isn't corrigible. Corrigibility is more following instructions than having your utility function.

Godzilla Strategies

You seem to believe that any plan involving what you call "godzilla strategies" is brittle. This is something I am not confidant in. Someone may find some strategy that can be shown to not be brittle.

What I would actually claim is roughly: * Godzilla plans are brittle by default * In order for the plan to become not-brittle, some part of it other than the use-Godzilla-to-fight-Mega-Godzilla part has to "do the hard part [] " of alignment You could probably bolt a Godzilla-vs-Mega-Gozilla mechanism onto a plan which already solved the hard parts of alignment via some other strategy, and end up with a viable plan.
Deep Learning Systems Are Not Less Interpretable Than Logic/Probability/Etc

Any network big enough to be interesting is big enough that the programmers don't have the time to write decorative labels. If you had some algorithm that magically produced a bays net with a billion intermediate nodes that accurately did some task, then it would also be an obvious black box. No one will have come up with a list of a billion decorative labels.

autonomy: the missing AGI ingredient?

Retain or forget information and skills over long time scales, in a way that serves its goals.  E.g. if it does forget some things, these should be things that are unusually unlikely to come in handy later.


If memory is cheap, designing it to just remember everything may be a good idea. And there may be some architectural reason why choosing to forget things is hard.

Bits of Optimization Can Only Be Lost Over A Distance

Suppose I send a few lines of code to a remote server. Those lines are enough to bootstrap a superintelligence which goes on to strongly optimize every aspect of the world. This counts as a fairly small amount of optimization power, as the chance of me landing on those lines of code by pure chance isn't that small. 

Correct, though note that this doesn't let you pick a specific optimization target for that AGI. You'd need more bits in order to specify an optimization target. In other words, there's still a meaningful sense in which you can't "steer" the world into a specific target other than one it was likely to hit anyway, at least not without more bits.
Why Agent Foundations? An Overly Abstract Explanation

But what if we instead design the system so that the leaked radio signal has zero mutual information with whatever signals are passed around inside the system? Then it doesn’t matter how much optimization pressure an adversary applies, they’re not going to figure out anything about those internal signals via leaked radio.

Flat out wrong. Its quite possible for A and B to have 0 mutual information. But A and B always have mutual information conditional on some C (assuming A and B each have information) Its possible for there to be absolutely no mutual i... (read more)

Against Time in Agent Models

This fails if there are closed timelike curves around. 

There is of course a very general formalism, whereby inputs and outputs are combined into aputs. Physical laws of causality, and restrictions like running on a reversible computer are just restrictions on the subsets of aputs accepted. 

Frame for Take-Off Speeds to inform compute governance & scaling alignment

Well covid was pretty much a massive obvious biorisk disaster. Did it lead to huge amounts of competence and resources being put into pandemic prevention?

My impression is not really. 

I mean I also expect an AI accident that kills a similar number of people to be pretty unlikely. But

1Logan Riggs Smith1mo
I wonder how much COVID got people to switch to working on Biorisks. What I’m interested here is talking to real researchers and asking what events would convince them to switch to alignment. Enumerating those would be useful for explaining to them. I think asking for specific capabilities would also be interesting. Or what specific capabilities they would’ve said in 2012. Then asking how long they expect between that capability and an x-catastrophe.
Law-Following AI 1: Sequence Introduction and Structure

I think we have a very long track record of embedding our values into law.

I mean you could say that if we haven't figured out how to do it well in the last 10,000 years, maybe don't plan on doing it in the next 10. That's kind of being mean though. 

If you have a functioning arbitration process, can't you just say "don't do bad things" and leave everything down to the arbitration?

I also kind of feel that adding laws is going in the direction of more complexity. And we really want as simple as possible. (Ie the minimal AI that can sit in a MIRI basement... (read more)

Law-Following AI 1: Sequence Introduction and Structure

First, I explicitly define LFAI to be about compliance with "some defined set of human-originating rules ('laws')." I do not argue that AI should follow all laws, which does indeed seem both hard and unnecessary.

Sure, some  of the failure modes mentioned at the bottom disappear when you do that. 

I think, a stronger claim: that the set of laws worth having AI follow is so small or unimportant as to be not worth trying to follow. That seems unlikely.

If some law is so obviously a good idea in all possible circumstances, the AI will do it whether it ... (read more)

As explained in the second post [] , I don't agree that that's implied if the AI is intent-aligned but not aligned with some deeper moral framework like CEV. I agree that that is an important question. I think we have a very long track record of embedding our values into law. The point of this sequence is to argue that we should therefore at a minimum explore pointing to (some subset of) laws, which has a number of benefits relative to trying to integrate values into the utility function objectively. I will defend that idea more fully in a later post, but to briefly motivate the idea, law (as compared to something like the values that would come from CEV) is more or less completely written down, much more agreed-upon, much more formalized, and has built-in processes for resolving ambiguities and contradictions. A cartoon version of this may be that A says "It's not clear whether that's legal, and if it's not legal it would be very bad (murder), so I can't proceed until there's clarification." If the human still wants to proceed, they can try to: 1. Change the law. 2. Get a declaratory judgment [] that it's not in fact against the law.
Law-Following AI 3: Lawless AI Agents Undermine Stabilizing Agreements

A(Y), with the objective of maximizing the value of Y's shares.

A sphere of self replicating robots, expanding at the speed of light. Turning all available atoms; including atoms that used to consist of judges, courts and shareholders; into endless stacks of banknotes. (With little "belongs to Y" notes pinned to them)

Law-Following AI 2: Intent Alignment + Superintelligence → Lawless AI (By Default)

Yes the superintelligent AI could out-lawyer everyone else in the courtrooms if it wanted to. My background assumptions would more be that the AI develops nanotech, and can make every lawyer court and policeman in the world vanish the moment the AI sees fit. The human legal system can only effect the world to the extent that humans listen to it, and enforce it. With a superintelligent AI in play. This may well be king Canute commanding the sea to halt.  (And a servant trying to bail the sea back with a teaspoon).

You don't really be considering the AI ... (read more)

(I realized the second H in that blockquote should be an A)
Law-Following AI 1: Sequence Introduction and Structure

Problem 1)

Human written laws are written with our current tech level in mind. There are laws against spewing out radio noise on the wrong frequencies, laws about hacking and encryption, laws about the safe handling of radioactive material. There are vast swaths of safety regulations. This is stuff that has been written for current tech levels (How well would roman law work today?)

This kind of "law following AI" sounds like it will produce nonsensical results as it tries to build warp drives and dyson spheres to the letter of modern building codes. Followin... (read more)

I appreciate your engagement! But I think your position is mistaken for a few reasons: First, I explicitly define LFAI to be about compliance with "some defined set of human-originating rules ('laws')." I do not argue that AI should follow all laws, which does indeed seem both hard and unnecessary. But I should have been more clear about this. (I did have some clarification in an earlier draft, which I guess I accidentally excised.) So I agree that there should be careful thought about which laws an LFAI should be trained to follow, for the reasons you cite. That question itself could be answered ethically or legally, and could vary with the system for the reasons you cite. But to make this a compelling objection against LFAI, you would have to make, I think, a stronger claim: that the set of laws worth having AI follow is so small or unimportant as to be not worth trying to follow. That seems unlikely. Second, you point to a lot of cases where the law would be underdetermined as to some out-of-distribution (from the caselaw/motivations of the law) action that the AI wanted to do, and say that: But I think LFAI would actually facilitate the result you want, not hinder it: 1. As I say, the pseudocode would first ask whether the act X being contemplated is clearly illegal with reference to the set of laws the LFAI is bound to follow. If it is, then that seems to be some decent (but not conclusive) evidence that there has been a deliberative process that prohibited X. 2. The pseudocode then asks whether X is maybe-illegal. If there has not been deliberation about analogous actions, that would suggest uncertainty, which would weigh in the favor of not-X. If the uncertainty is substantial, that might be decisive against X. 3. If the AI's estimation in either direction makes a mistake as to what humans' "true" preferences regarding X are, then the humans can decide to change the rules. The law is dynamic, and therefore the delibe
“Pivotal Act” Intentions: Negative Consequences and Fallacious Arguments

Imagine a contract. "We the undersigned agree that AGI is a powerful, useful but also potentially dangerous technology. To help avoid needlessly taking the same risk twice we agree that upon development of the worlds first AI, we will stop all attempts to create our own AI on the request of the first AI or its creators. In return the creator of the first AI will be nice with it." 

Then you aren't stopping all competitors. Your stopping the few people that can't cooperate. 

“Pivotal Act” Intentions: Negative Consequences and Fallacious Arguments

That is, if you are in a position where you have the option to build an AI capable of destroying all competing AI projects, the moment you notice this you should update heavily in favor of short timelines (zero in your case, but everyone else should be close behind) and fast takeoff speeds (since your AI has these impressive capabilities). You should also update on existing AI regulation being insufficient (since it was insufficient to prevent you)

A functioning Bayesian should have probably have updated to that position long before they actually have the A... (read more)

“Pivotal Act” Intentions: Negative Consequences and Fallacious Arguments

Suppose you develop the first AGI. It fooms. The AI tells you that it is capable of gaining total cosmic power by hacking physics in a millisecond. (Being an aligned AI, its waiting for your instructions before doing that.) It also tells you that the second AI project is only 1 day behind, and they have screwed up alignment.  


  1. Do nothing. Unfriendly AI gains total cosmic power tomorrow.
  2. Lightspeed bubble of hedonium. All humans are uploaded into a virtual utopia by femtobots. The sun is fully disassembled for raw materials whithin 10 minutes of
... (read more)
Takeoff speeds have a huge effect on what it means to work on AI x-risk

In contrast, in a slow takeoff world, many aspects of the AI alignment problems will already have showed up as alignment problems in non-AGI, non-x-risk-causing systems; in that world, there will be lots of industrial work on various aspects of the alignment problem, and so EAs now should think of themselves as trying to look ahead and figure out which margins of the alignment problem aren’t going to be taken care of by default, and try to figure out how to help out there. 


Lets consider the opposite. Imagine you are programming a self driving ca... (read more)

How BoMAI Might fail

Consider the strategy "do whatever action you predict to maximize the electricity in this particular piece of wire in your reward circuitry". This is a very general cognitive pattern that would maximize reward in the training runs. Now there are many different cognitive patterns that maximize reward in the training runs. But this is one simple one, so its at least reasonably plausible it is used.

What I was thinking when I wrote it was more like. When someone proposes a fancy concrete and vacuum box, they are claiming that the fancy box is doing something. ... (read more)

Late 2021 MIRI Conversations: AMA / Discussion

Like 10s of atoms across. So you aren't scaling down that much. (Most of your performance gains are in being able to stack your chips or whatever.

Late 2021 MIRI Conversations: AMA / Discussion

One thing in the posts I found surprising was Eliezers assertion that you needed a dangerous superintelligence to get nanotech. If the AI is expected to do everything itself, including inventing the concept of nanotech, I agree that this is dangerously superintelligent. 

However, suppose Alpha Quantum can reliably approximate the behaviour of almost any particle configuration. Not literally any, it can't run a quantum computer factorizing large numbers better than factoring algorithms, but enough to design a nanomachine. (It has been trained to approxi... (read more)

2Vojtech Kovarik3mo
(Not very sure I understood your description right, but here is my take:) * I think your proposal is not explaining some crucial steps, which are in fact hard. In particular, I understood it as "you have AI which can give you blueprints for nano sized machines". But I think we already have some blueprints, this isn't an issue. How we assemble them is an issue. * I expect that there will be more issues like this that you would find if you tried writing the plan in more detail. However, I share the general sentiment behind your post --- I also don't understand why you can't get some pivotal act by combining human intelligence with some narrow AI. I expect that Eliezer have tried to come up with such combinations and came away with some general takeaways on this being not realistic. But I haven't done this exercise, so it seems not obvious to me. Perhaps it would be beneficial if many more people tried doing the exercise and then communicated the takeaways.
1Gram Stone4mo
I got the impression Eliezer's claiming that a dangerous superintelligence is merely sufficient for nanotech. How would you save us with nanotech? It had better be good given all the hardware progress you just caused!
1Matthew "Vaniver" Graves4mo
Uh, how big do you think contemporary chips are?
Possible Dangers of the Unrestricted Value Learners

I think that given good value learning, safety isn't that difficult. I think even a fairly halfharted attempt at the sort of Naive safety measures discussed will probably lead to non catastrophic outcomes. 

Tell it about mindcrime from the start. Give it lots of hard disks, and tell it to store anything that might possibly resemble a human mind. It only needs to work well enough with a bunch of Miri people guiding it and answering its questions.  Post singularity, a superintelligence can see if there are any human minds in the simulations it created when young and dumb. If there are, welcome those minds to the utopia.

A positive case for how we might succeed at prosaic AI alignment

I think you might be able to design advanced nanosystems without AI doing long term real world optimization. 

Well a sufficiently large team of smart humans could probably design nanotech. The question is how much an AI could help.

Suppose unlimited compute. You program a simulation of quantum field theory. Add a GUI to see visualizations and move atoms around. Designing nanosystems is already quite a bit easier.

Now suppose you brute force search over all arrangements of 100 atoms within a 1nm box, searching for the configuration that most efficiently t... (read more)

Discussion with Eliezer Yudkowsky on AGI interventions

Under the Eliezerian view, (the pessimistic view that is producing <10% chances of success). These approaches are basically doomed. (See logistic success curve) 

Now I can't give overwhelming evidence for this position. Whisps of evidence maybe, but not an overwheming mountain of it. 

Under these sort of assumptions, building a container for an arbitrary superintelligence such that it has only 80% chance of being immediately lethal, and a 5% chance of being marginally useful is an achievment.

(and all possible steelmannings, that's a huge space)

Discussion with Eliezer Yudkowsky on AGI interventions

Lets say you use all these filtering tricks. I have no strong intuitions about whether these are actually sufficient to stop those kind of human manipulation attacks. (Of course, if your computer security isn't flawless, it can hack whatever computer system its on and bypass all these filters to show the humans arbitrary images and probably access the internet.) 

But maybe you can at quite significant expense make a Faraday cage sandbox, and then use these tricks. This is beyond what most companies will do in the name of safety. But Miri or whoever cou... (read more)

Well, if you restrict yourself to accepting the safe, testable advice, that may still be enough to put you enough years ahead of your competition to develop FAI before they develop AI. My meta-point: These methods may not be foolproof, but if currently it looks like no method is foolproof—if, indeed, you currently expect a <10% chance of success (again, a number I made up from the pessimistic impression I got)—then methods with a 90% chance, a 50% chance, etc. are worthwhile, and furthermore it becomes worth doing the work to refine these methods and estimate their success chances and rank them. Dismissing them all as imperfect is only worthwhile when you think perfection is achievable. (If you have a strong argument that method M and any steelmanning of it has a <1% chance of success, then that's good cause for dismissing it.)
Intelligence or Evolution?

Firstly this would be AI's looking at their own version of the AI alignment problem. This is not random mutation or anything like it. Secondly I would expect there to only be a few rounds maximum of self modification that runs risk to goals. (Likely 0 rounds) Firstly damaging goals looses a lot of utility. You would only do it if its a small change in goals for a big increase in intelligence. And if you really need to be smarter and you can't make yourself smarter while preserving your goals. 

You don't have millions of AI all with goals different from each other. The self upgrading step happens once before the AI starts to spread across star systems.

Intelligence or Evolution?

Error correction codes exist. They are low cost in terms of memory etc. Having a significant portion of your descendent mutate and do something you don't want is really bad.

If error correcting to the point where there is not a single mutation in the future only costs you 0.001% resources in extra hard drive, then <0.001% resources will be wasted due to mutations.

Evolution is kind of stupid compared to super-intelligences. Mutations are not going to be finding improvements. Because the superintelligence will be designing their own hardware and the hardwa... (read more)

1Christian Kleineidam8mo
Error correction codes help a superintelligence to avoid self-modifying but they don't allow goals necessarily to be stable with changing reasoning abilities.
Intelligence or Evolution?

Darwinian evolution as such isn't a thing amongst superintelligences. They can and will preserve terminal goals. This means the number of superintelligences running around is bounded by the number humans produce before the point the first ASI get powerful enough to stop any new rivals being created. Each AI will want to wipe out its rivals if it can. (unless they are managing to cooperate somewhat)  I don't think superintelligences would have humans kind of partial cooperation. Either near perfect cooperation, or near total competition. So this is a scenario where a smallish number of ASI's that have all foomed in parallel expand as a squabbling mess.

1Ramana Kumar8mo
Do you know of any formal or empirical arguments/evidence for the claim that evolution stops being relevant when there exist sufficiently intelligent entities (my possibly incorrect paraphrase of "Darwinian evolution as such isn't a thing amongst superintelligences")?
How much chess engine progress is about adapting to bigger computers?

I don't think this research, if done, would give you strong information about the field of AI as a whole. 

I think that, of the many topics researched by AI researchers, chess playing is far from the typical case. 

It's [chess] not the most relevant domain to future AI, but it's one with an unusually long history and unusually clear (and consistent) performance metrics.

An unusually long history implies unusually slow progress. There are problems that computers couldn't do at all a few years ago that they can do fairly efficiently now. Are there pro... (read more)

4Paul Christiano1y
Is your prediction that e.g. the behavior of chess will be unrelated to the behavior of SAT solving, or to factoring? Or that "those kinds of things" can be related to each other but not to image classification? Or is your prediction that the "new regime" for chess (now that ML is involved) will look qualitatively different than the old regime? I'm aware of very few examples of that occurring for problems that anyone cared about (i.e. in all such cases we found the breakthroughs before they mattered, not after). Are you aware of any? Factoring algorithms, or primality checking, seem like fine domains to study to me. I'm also interested in those and would be happy to offer similar bounties for similar analyses. I think it's pretty easy to talk about what distinguishes chess, SAT, classification, or factoring from multiplication. And I'm very comfortable predicting that the kind of AI that helps with R&D is more like the first four than like the last (though these things are surely on a spectrum). You may have different intuitions, I think that's fine, in which case this explains part of why this data is more interesting to me than you. Can you point to a domain where increasing R&D led to big insights that improved performance? Perhaps more importantly, machine learning is also "a slow accumulation of little tricks," so the analogy seems fine to me. (You might think that future AI is totally different, which is fine and not something I want to argue about here.) If Alice says this and so never learns about anything, and Bob instead learns a bunch of facts about a bunch of domains, I'm pretty comfortable betting on Bob being more accurate about most topics. I think the general point is: different domains differ from one another. You want to learn about a bunch of them and see what's going on, in order to reason about a new domain. I agree with the basic point that board games are selected to be domains where there is an obvious simple thing to do, and so prog
Confusions re: Higher-Level Game Theory

In a game with any finite number of players, and any finite number of actions per player.

Let  the set of possible outcomes.

Player   implements policy  . For each outcome in  , each player searches for proofs (in PA) that the outcome is impossible. It then takes the set of outcomes it has proved impossible, and maps that set to an action.

There is always a unique action that is chosen. Whatsmore, given oracles for 

Ie the set of actions you might take if you can pr... (read more)

Vignettes Workshop (AI Impacts)


The next task to fall to narrow AI is adversarial attacks against humans. Virulent memes and convincing ideologies become easy to generate on demand. A small number of people might see what is happening, and try to shield themselves off from dangerous ideas. They might even develop tools that auto-filter web content. Most of society becomes increasingly ideologized, with more decisions being made on political rather than practical grounds. Educational and research institutions become full of ideologues crowding out real research. There are some w... (read more)

Reward Is Not Enough

I would be potentially concerned that this is a trick that evolution can use, but human AI designers can't use safely. 

In particular, I think this is the sort of trick that produces usually fairly good results when you have a fixed environment, and can optimize the parameters and settings for that environment. Evolution can try millions of birds, tweaking the strengths of desire, to get something that kind of works. When the environment will be changing rapidly; when the relative capabilities of cognitive modules are highly uncertain and when self mod... (read more)

2Steve Byrnes1y
Sure. I wrote "similar to (or even isomorphic to)". We get to design it how we want. We can allow the planning submodule direct easy access to the workings of the choosing-words submodule if we want, or we can put strong barriers such that the planning submodule needs to engage in a complicated hacking project in order to learn what the choosing-words submodule is doing. I agree that the latter is probably a better setup. Sure, that's possible. My "negative" response is: There's no royal road to safe AGI, at least not that anyone knows of so far. In particular, if we talk specifically about "subagent"-type situations where there are mutually-contradictory goals within the AGI, I think that this is simply a situation we have to deal with, whether we like it or not. And if there's no way to safely deal with that kind of situation, then I think we're doomed. Why do I think that? For one thing, as I wrote in the text, it's arbitrary where we draw the line between "the AGI" and "other algorithms interacting with and trying to influence the AGI". If we draw a box around the AGI to also include things like gradient updates, or online feedback from humans, then we're definitely in that situation, because these are subsystems that are manipulating the AGI and don't share the AGI's (current) goals. For another thing: it's a complicated world and the AGI is not omniscient. If you think about logical induction [], the upshot is that when venturing into a complicated domain with unknown unknowns, you shouldn't expect nice well-formed self-consistent hypotheses attached to probabilities, you should expect a pile of partial patterns (i.e. hypotheses which make predictions about some things but are agnostic about others), supported by limited evidence. Then you can get situations where those partial patterns push in different directions, and "bid against each other". Now just apply exactly that same reasoning to "having desires about
Avoiding the instrumental policy by hiding information about humans

There are various ideas along the lines of "however much you tell the AI X it just forgets it".

I think that would be the direction to look in if you have a design tha'ts safe as long as it doesn't know X.

4Paul Christiano1y
Unpacking "mutual information," it seems like these designs basically take the form of an adversarial game: * The model computes some intermediate states. * An adversary tries to extract facts about the "unknowable" X. * The model is trained so that the adversary can't succeed. But this rests on the adversary not already knowing about X (otherwise we couldn't measure whether the adversary succeeds). In the case of mutual information, this is achieved formally by having a random variable that the adversary does not observe directly. If we are talking about "what humans are like" then we can't take the naive approach of mutual information (since we can't deploy the entire training process many times in different worlds where humans are different). So what do we do instead? The obvious approach is to just train the adversary to answer questions about humans, but then we somehow need to prevent the adversary from simply learning the facts themselves. If instead we don't give the adversary much time to learn, or much compute to work with, then we need to worry about cases where the model learns about X but is able to easily obscure that information from the adversary. (Mostly I'm dissuaded from this approach by other considerations, but I am still interested in whether we could make anything along these lines actually work.)
A naive alignment strategy and optimism about generalization

There may be predictable errors in the training data, such that instrumental policy actually gets a lower loss than answering honestly (because it responds strategically to errors).

If you are answering questions as text, there is a lot of choice in wording. There are many strings of text that are a correct answer, and the AI has to pick the one the human would use. In order to predict how a human would word an answer, you need a fairly good understanding of how they think (I think). 

4Paul Christiano1y
I agree you have to do something clever to make the intended policy plausibly optimal. The first part of my proposal in section 3 here [] was to avoid using "imitate humans," and to instead learn a function "Answer A is unambiguously worse than answer B." Then we update against policies only when they give unambiguously worse answers. (I think this still has a lot of problems; it's not obvious to me whether the problem is soluble.)
Speculations against GPT-n writing alignment papers

Maybe you did. I find it hard to distinguish inventing and half remembering ideas. 

If the training procedure either 

  1. Reliably produces mesaoptimisers with about the same values. or
  2. Reliably produces mesaoptimizers that can acausally cooperate
  3. The rest of the procedure allows one mesaoptimizer to take control of the whole output

Then using different copies of GPT-n trained from different seeds doesn't help.

If you just convert 1% of the english into network yourself, then all it needs to use is some error correction. Even without that, neural net struc... (read more)

Optimization, speculations on the X and only X problem.

I don't think that learning is moving around in codespace. In the simplest case, the AI is like any other non self modifying program. The code stays fixed as the programmers wrote it. The variables update. The AI doesn't start from null. The programmer starts from a blank text file, and adds code. Then they run the code. The AI can start with sophisticated behaviour the moment its turned on.

So are we talking about a program that could change from an X er to a Y er with a small change in the code written, or with a small amount of extra observation of the world?

[Event] Weekly Alignment Research Coffee Time

There seems to be some technical problem with the link. It gives me a "Our apologies, your invite link has now expired (actually several hours ago, but we hate to rush people).

We hope you had a really great time! :)" message. Edit: As of a few minutes after stated start time. It worked last week.

1Adam Shimi1y
Hey, it seems like other could use the link, so I'm not sure what went wrong. If you have the same problem tomorrow, just send me a PM.
Optimization, speculations on the X and only X problem.

My picture of an X and only X er is that the actual program you run should optimize only for X. I wasn't considering similarity in code space at all. 

Getting the lexicographically first formal ZFC proof of say the Collatz conjecture should be safe. Getting a random proof sampled from the set of all proofs < 1 terabyte long should be safe. But I think that there exist proofs that wouldn't be safe. There might be a valid proof of the conjecture that had the code for a paperclip maximizer encoded into the proof, and that exploited some flaw in compute... (read more)

Well, a main reason we'd care about codespace distance, is that it tells us something about how the agent will change as it learns (i.e. moves around in codespace). (This is involving time, since the agent is changing, contra your picture.) So a key (quasi)metric on codespace would be, "how much" learning does it take to get from here to there. The if True: x() else: y() program is an unnatural point in codespace in this metric: you'd have to have traversed the both the distances from null to x() and from null to y(), and it's weird to have traversed a distance and make no use of your position. A framing of the only-X problem is that traversing from null to a program that's an only-Xer according to your definition, might also constitute traversing almost all of the way from null to a program that's an only-Yer, where Y is "very different" from X.
Rogue AGI Embodies Valuable Intellectual Property

On its face, this story contains some shaky arguments. In particular, Alpha is initially going to have 100x-1,000,000x more resources than Alice. Even if Alice grows its resources faster, the alignment tax would have to be very large for Alice to end up with control of a substantial fraction of the world’s resources.

This makes the hidden assumption that "resources" is a good abstraction in this scenario. 

It is being assumed that the amount of resources an agent "has" is a well defined quantity. It assumes agent can only grow their resources slowly by ... (read more)

+1. Another way of putting it: This allegation of shaky arguments is itself super shaky, because it assumes that overcoming a 100x - 1,000,000x gap in "resources" implies a "very large" alignment tax. This just seems like a weird abstraction/framing to me that requires justification.

I wrote this Conquistadors post in part to argue against this abstraction/framing. These three conquistadors are something like a natural experiment in "how much conquering can the few do against the many, if they have various advantages?" (If I just selected a lone conqueror, ... (read more)

Agency in Conway’s Game of Life

Random Notes:

Firstly, why is the rest of the starting state random? In a universe where info can't be destroyed, like this one, random=max entropy. AI is only possible in this universe because the starting state is low entropy.

Secondly, reaching an arbitrary state can be impossible for reasons like conservation of mass energy momentum and charge. Any state close to an arbitrary state might be unreachable due to these conservation laws. Ie a state containing lots of negitive electric charges, and no positive charges being unreachable in our universe.

Well, q... (read more)

Challenge: know everything that the best go bot knows about go

I think that it isn't clear what constitutes "fully understanding" an algorithm. 

Say you pick something fairly simple, like a floating point squareroot algorithm. What does it take to fully understand that. 

You have to know what a squareroot is. Do you have to understand the maths behind Newton raphson iteration if the algorithm uses that? All the mathematical derivations, or just taking it as a mathematical fact that it works. Do you have to understand all the proofs about convergence rates. Or can you just go "yeah, 5 iterations seems to be eno... (read more)

That seems right. I think there's reason to believe that SGD doesn't do exactly this (nets that memorize random data have different learning curves than normal nets iirc?), and better reason to think it's possible to train a top go bot that doesn't do this. Yes, but luckily you don't have to do this for all algorithms, just the best go bot. Also as mentioned, I think you probably get to use a computer program for help, as long as you've written that computer program.
AMA: Paul Christiano, alignment researcher

"These technologies are deployed sufficiently narrowly that they do not meaningfully accelerate GWP growth." I think this is fairly hard for me to imagine (since their lead would need to be very large to outcompete another country that did deploy the technology to broadly accelerate growth), perhaps 5%?

I think there is a reasonable way it could happen even without an enormous lead. You just need either,

  1. Its very hard to capture a significant fraction of the gains from the tech.
  2. Tech progress scales very poorly in money. 

For example, suppose it is obviou... (read more)

Three reasons to expect long AI timelines

I don't think technological deployment is likely to take that long for AI's. With a physical device like a car or fridge, it takes time for people to set up the factories, and manufacture the devices. AI can be sent across the internet in moments. I don't know how long it takes google to go from say an algorithm that detects streets in satellite images to the results showing up in google maps, but its not anything like the decades it took those physical techs to roll out.

The slow roll-out scenario looks like this, AGI is developed using a technique that fu... (read more)

How do we prepare for final crunch time?

I don't actually think "It is really hard to know what sorts of AI alignment work are good this far out from transformative AI." is very helpful. 

It is currently fairly hard to tell what is good alignment work. A week from TAI, then either, good alignment work will be easier to recognise because of alignment progress not strongly correlated with capabilities, or good alignment research is just as hard to recognise. (More likely the latter) I can't think of any safety research that can be done on GPT3 that can't be done on GPT1. 

In my picture, res... (read more)

My AGI Threat Model: Misaligned Model-Based RL Agent

, it seems to me that under these assumptions there would probably be a series of increasingly-worse accidents spread out over some number of years, culminating in irreversible catastrophe, with humanity unable to coordinate to avoid that outcome—due to the coordination challenges in Assumptions 2-4.

I'm not seeing quite what the bad but not existential catastrophes would look like. I also think the AI has an incentive not to do this. My world model (assuming slow takeoff) goes more like this.

AI created in lab. Its a fairly skilled programmer and hacker. Ab... (read more)

1Steve Byrnes1y
Hmm, I dunno, I haven't thought it through very carefully. But I guess an AGI might require a supercomputer of resources and maybe there are only so many hackable supercomputers of the right type, and the AI only knows one exploit and leaves traces of its hacking that computer security people can follow, and meanwhile self-improvement is hard and slow (for example, in the first version you need to train for two straight years, and in the second self-improved version you "only" need to re-train for 18 months). If the AI can run on a botnet then there are more options, but maybe it can't deal with latency / packet loss / etc., maybe it doesn't know a good exploit, maybe security researchers find and take down the botnet C&C infrastructure, etc. Obviously this wouldn't happen with a radically superhuman AGI but that's not what we're talking about. But from my perspective, this isn't a decision-relevant argument. Either we're doomed in my scenario or we're even more doomed in yours. We still need to do the same research in advance. Well, we can be concerned about non-corrigible systems that act deceptively (cf. "treacherous turn"). And systems that have close-but-not-quite-right goals such that they're trying to do the right thing in test environments, but their goals veer away from humans' in other environments, I guess.
HCH Speculation Post #2A

In the giant lookup table space, HCH must converge to a cycle, although that convergence can be really slow. I think you have convergence to a stationary distribution if each layer is trained on a random mix of several previous layers. Of course, you can still have occilations in what is said within a policy fixed point. 

HCH Speculation Post #2A

If you want to prove things about fixed points of HCH in an iterated function setting, consider it a function from policies to policies. Let M be the set of messages (say ascii strings < 10kb.) Given a giant look up table T that maps M to M, we can create another giant look up table. For each m in M , give a human in a box the string m, and unlimited query access to T. Record their output.

The fixed points of this are the same as the fixed points of HCH. "Human with query access to" is a function on the space of policies.

1Charlie Steiner1y
Sure, but the interesting thing to me isn't fixed points in the input/output map, it's properties (i.e. attractors that are allowed to be large sets) that propagate from the answers seen by a human in response to their queries, into their output. Even if there's a fixed point, you have to further prove that this fixed point is consistent - that it's actually the answer to some askable question. I feel like this is sort of analogous to Hofstadter's q-sequence.
Comments on "The Singularity is Nowhere Near"

Tim Dettmers whole approach seems to be assuming that there are no computational shortcuts. No tricks that programmers can use for speed where evolution brute forced it. For example, maybe a part of the brain is doing a convolution by the straight forward brute force algorithm. And programmers can use fast fourier transform based convolutions. Maybe some neurons are discrete enough for us to use single bits. Maybe we can analyse the dimensions of the system and find that some are strongly attractive, and so just work in that subspace. 

Of course, all t... (read more)

Donald Hobson's Shortform

Yes. If you have an AI that has been given a small, easily completable task, like putting one block on top of another with a robot arm, that is probably just going to do your simple task. The idea is that you build a fairly secure box, and give the AI a task it can fairly easily achieve in that box. (With you having no intention of pressing the button so long as the AI seems to be acting normally. ) We want to make "just do your task" the best strategy.  If the box is less secure than we thought, or various other things go wrong, the AI will just shut... (read more)

Donald Hobson's Shortform

Here is a potential solution to stop button type problems, how does this go wrong?

Taking into account uncertainty, the algorithm is.

Calculate the X maximizing best action in a world where the stop button does nothing.

Calculate the X maximizing best action in a world where the stop button works. 

If they are the same, do that. Otherwise shutdown.

It seems like the button-works action will usually be some variety of "take preemptive action to ensure the button won't be pressed" and so the AI will have a high chance to shut down at each decision step.
Donald Hobson's Shortform

 rough stop button problem ideas.

You want an AI that believes its actions can't effect the button. You could use causal counterfactuals. An imaginary button that presses itself at random. You can scale the likelihood of worlds up and down, to ensure the button is equally likely to be pressed in each world. (Wierd behaviour, not recomended) You can put the AI in the logical counterfactual of "my actions don't influence the chance the button is pressed." if you can figure out logical counterfactuals.

Or you can get the AI to simulate what it would do if it were an X maximizer. If it thinks the button won't be pressed, it does that, otherwise it does nothing. (not clear how to generalize to uncertain AI)

1Donald Hobson1y
Here is a potential solution to stop button type problems, how does this go wrong? Taking into account uncertainty, the algorithm is. Calculate the X maximizing best action in a world where the stop button does nothing. Calculate the X maximizing best action in a world where the stop button works. If they are the same, do that. Otherwise shutdown.
Load More