All of Donald Hobson's Comments + Replies

A positive case for how we might succeed at prosaic AI alignment

I think you might be able to design advanced nanosystems without AI doing long term real world optimization. 

Well a sufficiently large team of smart humans could probably design nanotech. The question is how much an AI could help.

Suppose unlimited compute. You program a simulation of quantum field theory. Add a GUI to see visualizations and move atoms around. Designing nanosystems is already quite a bit easier.

Now suppose you brute force search over all arrangements of 100 atoms within a 1nm box, searching for the configuration that most efficiently t... (read more)

Discussion with Eliezer Yudkowsky on AGI interventions

Under the Eliezerian view, (the pessimistic view that is producing <10% chances of success). These approaches are basically doomed. (See logistic success curve) 

Now I can't give overwhelming evidence for this position. Whisps of evidence maybe, but not an overwheming mountain of it. 

Under these sort of assumptions, building a container for an arbitrary superintelligence such that it has only 80% chance of being immediately lethal, and a 5% chance of being marginally useful is an achievment.

(and all possible steelmannings, that's a huge space)

Discussion with Eliezer Yudkowsky on AGI interventions

Lets say you use all these filtering tricks. I have no strong intuitions about whether these are actually sufficient to stop those kind of human manipulation attacks. (Of course, if your computer security isn't flawless, it can hack whatever computer system its on and bypass all these filters to show the humans arbitrary images and probably access the internet.) 

But maybe you can at quite significant expense make a Faraday cage sandbox, and then use these tricks. This is beyond what most companies will do in the name of safety. But Miri or whoever cou... (read more)

1localdeity1moWell, if you restrict yourself to accepting the safe, testable advice, that may still be enough to put you enough years ahead of your competition to develop FAI before they develop AI. My meta-point: These methods may not be foolproof, but if currently it looks like no method is foolproof—if, indeed, you currently expect a <10% chance of success (again, a number I made up from the pessimistic impression I got)—then methods with a 90% chance, a 50% chance, etc. are worthwhile, and furthermore it becomes worth doing the work to refine these methods and estimate their success chances and rank them. Dismissing them all as imperfect is only worthwhile when you think perfection is achievable. (If you have a strong argument that method M and any steelmanning of it has a <1% chance of success, then that's good cause for dismissing it.)
Intelligence or Evolution?

Firstly this would be AI's looking at their own version of the AI alignment problem. This is not random mutation or anything like it. Secondly I would expect there to only be a few rounds maximum of self modification that runs risk to goals. (Likely 0 rounds) Firstly damaging goals looses a lot of utility. You would only do it if its a small change in goals for a big increase in intelligence. And if you really need to be smarter and you can't make yourself smarter while preserving your goals. 

You don't have millions of AI all with goals different from each other. The self upgrading step happens once before the AI starts to spread across star systems.

Intelligence or Evolution?

Error correction codes exist. They are low cost in terms of memory etc. Having a significant portion of your descendent mutate and do something you don't want is really bad.

If error correcting to the point where there is not a single mutation in the future only costs you 0.001% resources in extra hard drive, then <0.001% resources will be wasted due to mutations.

Evolution is kind of stupid compared to super-intelligences. Mutations are not going to be finding improvements. Because the superintelligence will be designing their own hardware and the hardwa... (read more)

1Christian Kleineidam1moError correction codes help a superintelligence to avoid self-modifying but they don't allow goals necessarily to be stable with changing reasoning abilities.
Intelligence or Evolution?

Darwinian evolution as such isn't a thing amongst superintelligences. They can and will preserve terminal goals. This means the number of superintelligences running around is bounded by the number humans produce before the point the first ASI get powerful enough to stop any new rivals being created. Each AI will want to wipe out its rivals if it can. (unless they are managing to cooperate somewhat)  I don't think superintelligences would have humans kind of partial cooperation. Either near perfect cooperation, or near total competition. So this is a scenario where a smallish number of ASI's that have all foomed in parallel expand as a squabbling mess.

1Ramana Kumar1moDo you know of any formal or empirical arguments/evidence for the claim that evolution stops being relevant when there exist sufficiently intelligent entities (my possibly incorrect paraphrase of "Darwinian evolution as such isn't a thing amongst superintelligences")?
How much chess engine progress is about adapting to bigger computers?

I don't think this research, if done, would give you strong information about the field of AI as a whole. 

I think that, of the many topics researched by AI researchers, chess playing is far from the typical case. 

It's [chess] not the most relevant domain to future AI, but it's one with an unusually long history and unusually clear (and consistent) performance metrics.

An unusually long history implies unusually slow progress. There are problems that computers couldn't do at all a few years ago that they can do fairly efficiently now. Are there pro... (read more)

4Paul Christiano5moIs your prediction that e.g. the behavior of chess will be unrelated to the behavior of SAT solving, or to factoring? Or that "those kinds of things" can be related to each other but not to image classification? Or is your prediction that the "new regime" for chess (now that ML is involved) will look qualitatively different than the old regime? I'm aware of very few examples of that occurring for problems that anyone cared about (i.e. in all such cases we found the breakthroughs before they mattered, not after). Are you aware of any? Factoring algorithms, or primality checking, seem like fine domains to study to me. I'm also interested in those and would be happy to offer similar bounties for similar analyses. I think it's pretty easy to talk about what distinguishes chess, SAT, classification, or factoring from multiplication. And I'm very comfortable predicting that the kind of AI that helps with R&D is more like the first four than like the last (though these things are surely on a spectrum). You may have different intuitions, I think that's fine, in which case this explains part of why this data is more interesting to me than you. Can you point to a domain where increasing R&D led to big insights that improved performance? Perhaps more importantly, machine learning is also "a slow accumulation of little tricks," so the analogy seems fine to me. (You might think that future AI is totally different, which is fine and not something I want to argue about here.) If Alice says this and so never learns about anything, and Bob instead learns a bunch of facts about a bunch of domains, I'm pretty comfortable betting on Bob being more accurate about most topics. I think the general point is: different domains differ from one another. You want to learn about a bunch of them and see what's going on, in order to reason about a new domain. I agree with the basic point that board games are selected to be domains where there is an obvious simple thing to do, and so prog
Confusions re: Higher-Level Game Theory

In a game with any finite number of players, and any finite number of actions per player.

Let  the set of possible outcomes.

Player   implements policy  . For each outcome in  , each player searches for proofs (in PA) that the outcome is impossible. It then takes the set of outcomes it has proved impossible, and maps that set to an action.

There is always a unique action that is chosen. Whatsmore, given oracles for 

Ie the set of actions you might take if you can pr... (read more)

Vignettes Workshop (AI Impacts)


The next task to fall to narrow AI is adversarial attacks against humans. Virulent memes and convincing ideologies become easy to generate on demand. A small number of people might see what is happening, and try to shield themselves off from dangerous ideas. They might even develop tools that auto-filter web content. Most of society becomes increasingly ideologized, with more decisions being made on political rather than practical grounds. Educational and research institutions become full of ideologues crowding out real research. There are some w... (read more)

Reward Is Not Enough

I would be potentially concerned that this is a trick that evolution can use, but human AI designers can't use safely. 

In particular, I think this is the sort of trick that produces usually fairly good results when you have a fixed environment, and can optimize the parameters and settings for that environment. Evolution can try millions of birds, tweaking the strengths of desire, to get something that kind of works. When the environment will be changing rapidly; when the relative capabilities of cognitive modules are highly uncertain and when self mod... (read more)

2Steve Byrnes6moSure. I wrote "similar to (or even isomorphic to)". We get to design it how we want. We can allow the planning submodule direct easy access to the workings of the choosing-words submodule if we want, or we can put strong barriers such that the planning submodule needs to engage in a complicated hacking project in order to learn what the choosing-words submodule is doing. I agree that the latter is probably a better setup. Sure, that's possible. My "negative" response is: There's no royal road to safe AGI, at least not that anyone knows of so far. In particular, if we talk specifically about "subagent"-type situations where there are mutually-contradictory goals within the AGI, I think that this is simply a situation we have to deal with, whether we like it or not. And if there's no way to safely deal with that kind of situation, then I think we're doomed. Why do I think that? For one thing, as I wrote in the text, it's arbitrary where we draw the line between "the AGI" and "other algorithms interacting with and trying to influence the AGI". If we draw a box around the AGI to also include things like gradient updates, or online feedback from humans, then we're definitely in that situation, because these are subsystems that are manipulating the AGI and don't share the AGI's (current) goals. For another thing: it's a complicated world and the AGI is not omniscient. If you think about logical induction [], the upshot is that when venturing into a complicated domain with unknown unknowns, you shouldn't expect nice well-formed self-consistent hypotheses attached to probabilities, you should expect a pile of partial patterns (i.e. hypotheses which make predictions about some things but are agnostic about others), supported by limited evidence. Then you can get situations where those partial patterns push in different directions, and "bid against each other". Now just apply exactly that same reasoning to "having desires about
Avoiding the instrumental policy by hiding information about humans

There are various ideas along the lines of "however much you tell the AI X it just forgets it".

I think that would be the direction to look in if you have a design tha'ts safe as long as it doesn't know X.

4Paul Christiano6moUnpacking "mutual information," it seems like these designs basically take the form of an adversarial game: * The model computes some intermediate states. * An adversary tries to extract facts about the "unknowable" X. * The model is trained so that the adversary can't succeed. But this rests on the adversary not already knowing about X (otherwise we couldn't measure whether the adversary succeeds). In the case of mutual information, this is achieved formally by having a random variable that the adversary does not observe directly. If we are talking about "what humans are like" then we can't take the naive approach of mutual information (since we can't deploy the entire training process many times in different worlds where humans are different). So what do we do instead? The obvious approach is to just train the adversary to answer questions about humans, but then we somehow need to prevent the adversary from simply learning the facts themselves. If instead we don't give the adversary much time to learn, or much compute to work with, then we need to worry about cases where the model learns about X but is able to easily obscure that information from the adversary. (Mostly I'm dissuaded from this approach by other considerations, but I am still interested in whether we could make anything along these lines actually work.)
A naive alignment strategy and optimism about generalization

There may be predictable errors in the training data, such that instrumental policy actually gets a lower loss than answering honestly (because it responds strategically to errors).

If you are answering questions as text, there is a lot of choice in wording. There are many strings of text that are a correct answer, and the AI has to pick the one the human would use. In order to predict how a human would word an answer, you need a fairly good understanding of how they think (I think). 

4Paul Christiano6moI agree you have to do something clever to make the intended policy plausibly optimal. The first part of my proposal in section 3 here [] was to avoid using "imitate humans," and to instead learn a function "Answer A is unambiguously worse than answer B." Then we update against policies only when they give unambiguously worse answers. (I think this still has a lot of problems; it's not obvious to me whether the problem is soluble.)
Speculations against GPT-n writing alignment papers

Maybe you did. I find it hard to distinguish inventing and half remembering ideas. 

If the training procedure either 

  1. Reliably produces mesaoptimisers with about the same values. or
  2. Reliably produces mesaoptimizers that can acausally cooperate
  3. The rest of the procedure allows one mesaoptimizer to take control of the whole output

Then using different copies of GPT-n trained from different seeds doesn't help.

If you just convert 1% of the english into network yourself, then all it needs to use is some error correction. Even without that, neural net struc... (read more)

Optimization, speculations on the X and only X problem.

I don't think that learning is moving around in codespace. In the simplest case, the AI is like any other non self modifying program. The code stays fixed as the programmers wrote it. The variables update. The AI doesn't start from null. The programmer starts from a blank text file, and adds code. Then they run the code. The AI can start with sophisticated behaviour the moment its turned on.

So are we talking about a program that could change from an X er to a Y er with a small change in the code written, or with a small amount of extra observation of the world?

[Event] Weekly Alignment Research Coffee Time (12/06)

There seems to be some technical problem with the link. It gives me a "Our apologies, your invite link has now expired (actually several hours ago, but we hate to rush people).

We hope you had a really great time! :)" message. Edit: As of a few minutes after stated start time. It worked last week.

1Adam Shimi6moHey, it seems like other could use the link, so I'm not sure what went wrong. If you have the same problem tomorrow, just send me a PM.
Optimization, speculations on the X and only X problem.

My picture of an X and only X er is that the actual program you run should optimize only for X. I wasn't considering similarity in code space at all. 

Getting the lexicographically first formal ZFC proof of say the Collatz conjecture should be safe. Getting a random proof sampled from the set of all proofs < 1 terabyte long should be safe. But I think that there exist proofs that wouldn't be safe. There might be a valid proof of the conjecture that had the code for a paperclip maximizer encoded into the proof, and that exploited some flaw in compute... (read more)

0TekhneMakre6moWell, a main reason we'd care about codespace distance, is that it tells us something about how the agent will change as it learns (i.e. moves around in codespace). (This is involving time, since the agent is changing, contra your picture.) So a key (quasi)metric on codespace would be, "how much" learning does it take to get from here to there. The if True: x() else: y() program is an unnatural point in codespace in this metric: you'd have to have traversed the both the distances from null to x() and from null to y(), and it's weird to have traversed a distance and make no use of your position. A framing of the only-X problem is that traversing from null to a program that's an only-Xer according to your definition, might also constitute traversing almost all of the way from null to a program that's an only-Yer, where Y is "very different" from X.
Rogue AGI Embodies Valuable Intellectual Property

On its face, this story contains some shaky arguments. In particular, Alpha is initially going to have 100x-1,000,000x more resources than Alice. Even if Alice grows its resources faster, the alignment tax would have to be very large for Alice to end up with control of a substantial fraction of the world’s resources.

This makes the hidden assumption that "resources" is a good abstraction in this scenario. 

It is being assumed that the amount of resources an agent "has" is a well defined quantity. It assumes agent can only grow their resources slowly by ... (read more)

+1. Another way of putting it: This allegation of shaky arguments is itself super shaky, because it assumes that overcoming a 100x - 1,000,000x gap in "resources" implies a "very large" alignment tax. This just seems like a weird abstraction/framing to me that requires justification.

I wrote this Conquistadors post in part to argue against this abstraction/framing. These three conquistadors are something like a natural experiment in "how much conquering can the few do against the many, if they have various advantages?" (If I just selected a lone conqueror, ... (read more)

Agency in Conway’s Game of Life

Random Notes:

Firstly, why is the rest of the starting state random? In a universe where info can't be destroyed, like this one, random=max entropy. AI is only possible in this universe because the starting state is low entropy.

Secondly, reaching an arbitrary state can be impossible for reasons like conservation of mass energy momentum and charge. Any state close to an arbitrary state might be unreachable due to these conservation laws. Ie a state containing lots of negitive electric charges, and no positive charges being unreachable in our universe.

Well, q... (read more)

Challenge: know everything that the best go bot knows about go

I think that it isn't clear what constitutes "fully understanding" an algorithm. 

Say you pick something fairly simple, like a floating point squareroot algorithm. What does it take to fully understand that. 

You have to know what a squareroot is. Do you have to understand the maths behind Newton raphson iteration if the algorithm uses that? All the mathematical derivations, or just taking it as a mathematical fact that it works. Do you have to understand all the proofs about convergence rates. Or can you just go "yeah, 5 iterations seems to be eno... (read more)

1DanielFilan7moThat seems right. I think there's reason to believe that SGD doesn't do exactly this (nets that memorize random data have different learning curves than normal nets iirc?), and better reason to think it's possible to train a top go bot that doesn't do this. Yes, but luckily you don't have to do this for all algorithms, just the best go bot. Also as mentioned, I think you probably get to use a computer program for help, as long as you've written that computer program.
AMA: Paul Christiano, alignment researcher

"These technologies are deployed sufficiently narrowly that they do not meaningfully accelerate GWP growth." I think this is fairly hard for me to imagine (since their lead would need to be very large to outcompete another country that did deploy the technology to broadly accelerate growth), perhaps 5%?

I think there is a reasonable way it could happen even without an enormous lead. You just need either,

  1. Its very hard to capture a significant fraction of the gains from the tech.
  2. Tech progress scales very poorly in money. 

For example, suppose it is obviou... (read more)

Three reasons to expect long AI timelines

I don't think technological deployment is likely to take that long for AI's. With a physical device like a car or fridge, it takes time for people to set up the factories, and manufacture the devices. AI can be sent across the internet in moments. I don't know how long it takes google to go from say an algorithm that detects streets in satellite images to the results showing up in google maps, but its not anything like the decades it took those physical techs to roll out.

The slow roll-out scenario looks like this, AGI is developed using a technique that fu... (read more)

How do we prepare for final crunch time?

I don't actually think "It is really hard to know what sorts of AI alignment work are good this far out from transformative AI." is very helpful. 

It is currently fairly hard to tell what is good alignment work. A week from TAI, then either, good alignment work will be easier to recognise because of alignment progress not strongly correlated with capabilities, or good alignment research is just as hard to recognise. (More likely the latter) I can't think of any safety research that can be done on GPT3 that can't be done on GPT1. 

In my picture, res... (read more)

My AGI Threat Model: Misaligned Model-Based RL Agent

, it seems to me that under these assumptions there would probably be a series of increasingly-worse accidents spread out over some number of years, culminating in irreversible catastrophe, with humanity unable to coordinate to avoid that outcome—due to the coordination challenges in Assumptions 2-4.

I'm not seeing quite what the bad but not existential catastrophes would look like. I also think the AI has an incentive not to do this. My world model (assuming slow takeoff) goes more like this.

AI created in lab. Its a fairly skilled programmer and hacker. Ab... (read more)

1Steve Byrnes8moHmm, I dunno, I haven't thought it through very carefully. But I guess an AGI might require a supercomputer of resources and maybe there are only so many hackable supercomputers of the right type, and the AI only knows one exploit and leaves traces of its hacking that computer security people can follow, and meanwhile self-improvement is hard and slow (for example, in the first version you need to train for two straight years, and in the second self-improved version you "only" need to re-train for 18 months). If the AI can run on a botnet then there are more options, but maybe it can't deal with latency / packet loss / etc., maybe it doesn't know a good exploit, maybe security researchers find and take down the botnet C&C infrastructure, etc. Obviously this wouldn't happen with a radically superhuman AGI but that's not what we're talking about. But from my perspective, this isn't a decision-relevant argument. Either we're doomed in my scenario or we're even more doomed in yours. We still need to do the same research in advance. Well, we can be concerned about non-corrigible systems that act deceptively (cf. "treacherous turn"). And systems that have close-but-not-quite-right goals such that they're trying to do the right thing in test environments, but their goals veer away from humans' in other environments, I guess.
HCH Speculation Post #2A

In the giant lookup table space, HCH must converge to a cycle, although that convergence can be really slow. I think you have convergence to a stationary distribution if each layer is trained on a random mix of several previous layers. Of course, you can still have occilations in what is said within a policy fixed point. 

HCH Speculation Post #2A

If you want to prove things about fixed points of HCH in an iterated function setting, consider it a function from policies to policies. Let M be the set of messages (say ascii strings < 10kb.) Given a giant look up table T that maps M to M, we can create another giant look up table. For each m in M , give a human in a box the string m, and unlimited query access to T. Record their output.

The fixed points of this are the same as the fixed points of HCH. "Human with query access to" is a function on the space of policies.

1Charlie Steiner9moSure, but the interesting thing to me isn't fixed points in the input/output map, it's properties (i.e. attractors that are allowed to be large sets) that propagate from the answers seen by a human in response to their queries, into their output. Even if there's a fixed point, you have to further prove that this fixed point is consistent - that it's actually the answer to some askable question. I feel like this is sort of analogous to Hofstadter's q-sequence.
Comments on "The Singularity is Nowhere Near"

Tim Dettmers whole approach seems to be assuming that there are no computational shortcuts. No tricks that programmers can use for speed where evolution brute forced it. For example, maybe a part of the brain is doing a convolution by the straight forward brute force algorithm. And programmers can use fast fourier transform based convolutions. Maybe some neurons are discrete enough for us to use single bits. Maybe we can analyse the dimensions of the system and find that some are strongly attractive, and so just work in that subspace. 

Of course, all t... (read more)

Donald Hobson's Shortform

Yes. If you have an AI that has been given a small, easily completable task, like putting one block on top of another with a robot arm, that is probably just going to do your simple task. The idea is that you build a fairly secure box, and give the AI a task it can fairly easily achieve in that box. (With you having no intention of pressing the button so long as the AI seems to be acting normally. ) We want to make "just do your task" the best strategy.  If the box is less secure than we thought, or various other things go wrong, the AI will just shut... (read more)

Donald Hobson's Shortform

Here is a potential solution to stop button type problems, how does this go wrong?

Taking into account uncertainty, the algorithm is.

Calculate the X maximizing best action in a world where the stop button does nothing.

Calculate the X maximizing best action in a world where the stop button works. 

If they are the same, do that. Otherwise shutdown.

0Measure9moIt seems like the button-works action will usually be some variety of "take preemptive action to ensure the button won't be pressed" and so the AI will have a high chance to shut down at each decision step.
Donald Hobson's Shortform

 rough stop button problem ideas.

You want an AI that believes its actions can't effect the button. You could use causal counterfactuals. An imaginary button that presses itself at random. You can scale the likelihood of worlds up and down, to ensure the button is equally likely to be pressed in each world. (Wierd behaviour, not recomended) You can put the AI in the logical counterfactual of "my actions don't influence the chance the button is pressed." if you can figure out logical counterfactuals.

Or you can get the AI to simulate what it would do if it were an X maximizer. If it thinks the button won't be pressed, it does that, otherwise it does nothing. (not clear how to generalize to uncertain AI)

1Donald Hobson9moHere is a potential solution to stop button type problems, how does this go wrong? Taking into account uncertainty, the algorithm is. Calculate the X maximizing best action in a world where the stop button does nothing. Calculate the X maximizing best action in a world where the stop button works. If they are the same, do that. Otherwise shutdown.
Non-Obstruction: A Simple Concept Motivating Corrigibility

This definition of a non-obstructionist AI takes what would happen if it wasn't switched on as the base case. 

This can give weird infinite hall of mirrors effects if another very similar non-obstructionist AI would have been switched on, and another behind them. (Ie a human whose counterfactual behaviour on AI failure is to reboot and try again.) This would tend to lead to a kind of fixed point effect, where the attainable utility landscape is almost identical with the AI on and off. At some point it bottoms out when the hypothetical U utility humans ... (read more)

1Alex Turner10moThanks for leaving this comment. I think this kind of counterfactual is interesting as a thought experiment, but not really relevant to conceptual analysis using this framework. I suppose I should have explained more clearly that the off-state counterfactual was meant to be interpreted with a bit of reasonableness, like "what would we reasonably do if we, the designers, tried to achieve goals using our own power?". To avoid issues of probable civilizational extinction by some other means soon after without the AI's help, just imagine that you time-box the counterfactual goal pursuit to, say, a month. I can easily imagine what my (subjective) attainable utility would be if I just tried to do things on my own, without the AI's help. In this counterfactual, I'm not really tempted to switch on similar non-obstructionist AIs. It's this kind of counterfactual that I usually consider for AU landscape-style analysis, because I think it's a useful way to reason [] about how the world is changing.
A Critique of Non-Obstruction

What if, the moment the AI boots up, a bunch of humans tell it "our goals aren't on a spike." (It could technically realize this based on anthropic reasoning. If humans really wanted to maximize paperclips, and its easy to build a paperclip maximizer, we wouldn't have built a non-obstructive AI.)

We are talking policies here. If the humans goals were on a spike, they wouldn't have said that. So If the AI takes the policy of giving us a smoother attainable utility function in this case,  this still fits the bill. 

Actually I think that this definiti... (read more)

1Joe_Collman10moI think things are already fine for any spike outside S, e.g. paperclip maximiser, since non-obstruction doesn't say anything there. I actually think saying "our goals aren't on a spike" amounts to a stronger version of my [assume humans know what the AI knows as the baseline]. I'm now thinking that neither of these will work, for much the same reason. (see below) The way I'm imagining spikes within S is like this: We define a pretty broad S, presumably implicitly, hoping to give ourselves a broad range of non-obstruction. For all P in U we later conclude that our actual goals are in T ⊂ U ⊂S. We optimize for AU on T, overlooking some factors that are important for P in U \ T. We do better on T than we would have by optimising more broadly over U (we can cut corners in U \ T). We do worse on U \ T since we weren't directly optimising for that set (AU on U \ T varies quite a lot). We then get an AU spike within U, peaking on T. The reason I don't think telling the AI something like "our goals aren't on a spike" will help, is that this would not be a statement about our goals, but about our understanding and competence. It'd be to say that we never optimise for a goal set we mistakenly believe includes our true goals (and that we hit what we aim for similarly well for any target within S). It amounts to saying something like "We don't have blind-spots", "We won't aim for the wrong target", or, in the terms above, "We will never mistake any T for any U". In this context, this is stronger and more general than my suggestion of "assume for the baseline that we know everything you know". (lack of that knowledge is just one way to screw up the optimisation target) In either case, this is equivalent to telling the AI to assume an unrealistically proficient/well-informed pol. The issue is that, as far as non-obstruction is concerned, the AI can then take actions which have arbitrarily bad consequences for us if we don't perform as well as pol. I.e. non-obstruction th
Optimal play in human-judged Debate usually won't answer your question

Neural nets have adversarial examples. Adversarial optimization of part of the input can make the network do all sorts of things, including computations.

If you optimise the inputs to a buggy program hard enough, you get something that crashes the program in a way that happens to score highly. 

I suspect that optimal play on most adversarial computer games looks like a game of core wars.

Of course, if we really have myopic debate, not any mesaoptimisers, then neither AI is optimizing to have a long term effect or to... (read more)

2Joe_Collman10moSure - there are many ways for debate to fail with extremely capable debaters. Though most of the more exotic mind-hack-style outcomes seem a lot less likely once you're evaluating local nodes with ~1000 characters for each debater. However, all of this comes under my: I’ll often omit the caveat “If debate works as intended aside from this issue…” There are many ways for debate to fail. I'm pointing out what happens even if it works. I.e. I'm claiming that question-ignoring will happen even if the judge is only ever persuaded of true statements, gets a balanced view of things, and is neither manipulated, nor mind-hacked (unless you believe a response of "Your house is on fire" to "What is 2 + 2?" is malign, if your house is indeed on fire). Debate can 'work' perfectly, with the judge only ever coming to believe true statements, and your questions will still usually not be answered. (because [believing X is the better answer to the question] and [deciding X should be the winning answer, given the likely consequences] are not the same thing) The fundamental issue is: [what the judge most wants] is not [the best direct answer to the question asked].
What technologies could cause world GDP doubling times to be <8 years?

"Do paperclips count as GDP" (Quote from someone)

What is GDP doing in a grey goo scenario. What if there are actually several types of goo that are trading mass and energy between each other? 

What about an economy in which utterly vast amounts of money are being shuffled around on computers, but not that much is actually being produced.

There are a bunch of scenarios where GDP could reasonably be interpreted as multiple different quantities. In the last case, once you decide whether virtual money counts or not, then GDP is a useful measure of what is going on, but measures something different in each case.

What technologies could cause world GDP doubling times to be <8 years?

Excluding AI, and things like human intelligence enhancement, mind uploading ect.

I think that the biggest increases in the economy would be from more automated manufacturing. The extreme case is fully programmable molecular nanotech. The sort that can easily self replicate and where making anything is as easy as saying where to put the atoms. This would potentially lead to a substantially faster economic growth rate than 9%. 

There are various ways that the partially developed tech might be less powerful.

Maybe the nanotech uses a lot of energy, or some... (read more)

Misalignment and misuse: whose values are manifest?

I think that you have a 4th failure mode. Moloch.

Confucianism in AI Alignment

If an inner optimizer could exploit some distribution shift between the training and deployment environments, then performance-in-training is a bad proxy for performance-in-deployment.

Suppose you are making a self driving car. The training environment is a videogame like environment. The rendering is pretty good. A human looking at the footage would not easily be able to say it was obviously fake. An expert going over the footage in detail could spot subtle artefacts. The diffuse translucency on leaves in the background isn't quite right. When another car ... (read more)

The date of AI Takeover is not the day the AI takes over

But this isn’t quite right, at least not when “AI takeover” is interpreted in the obvious way, as meaning that an AI or group of AIs is firmly in political control of the world, ordering humans about, monopolizing violence, etc. Even if AIs don’t yet have that sort of political control, it may already be too late.

The AI's will probably never be in a position of political control. I suspect the AI would bootstrap self-replicating (nano?) tech. It might find a way to totally brainwash people, and spread it across the internet. The end game is always going to... (read more)

1Daniel Kokotajlo1yI think this depends on how fast the takeoff is. If crossing the human range, and recursive self-improvement, take months or years rather than days, there may be an intermediate period where political control is used to get more resources and security. Politics can happen on a timespan of weeks or months. Brainwashing people is a special case of politics. Yeah I agree the endgame is always nanobot swarms etc.
Needed: AI infohazard policy

Suppose you think that both capabilities and alignment behave like abstract quantities, ie real numbers.

And suppose that you think there is a threshold amount of alignment, and a threshold amount of capabilities, making a race to which threshold is reached first. 

If you also assume that the contribution of your research is fairly small, and our uncertainty about the threshold locations is high, 

then we have the heuristic, only publish your research if the ratio between capabilities and alignment that it produces is better than the ratio over all ... (read more)

0Vanessa Kosoy1yHmm, so in this model we assume that (i) the research output of the rest of the world is known (ii) we are deciding about one result only (iii) the thresholds are unknown. In this case you are right that we need to compare our alignment : capability ratio to the rest of the world's alignment : capability ratio. Now assume that, instead of just one result overall, you produce a single result every year. Most of the results in the sequence have alignment : capability ratio way above the rest of the world, but then there is a year in which the ratio is only barely above the rest of the world. In this case, you are better off not publishing the irregular result, even though the naive ratio criterion says to publish. We can reconcile it with the previous model by including your own research in the reference, but it creates a somewhat confusing self-reference. Second, we can switch to modeling the research output of the rest of the world as a random walk. In this case, if the average direction of progress is pointing towards failure, then moving along this direction is net negative, since it reduces the chance to get success by luck.
Clarifying “What failure looks like”

I think that most easy to measure goals, if optimised hard enough, eventually end up with a universe tiled with molecular smiley faces. Consider the law enforcement AI. There is no sharp line between education programs, and reducing lead pollution, to using nanotech to rewire human brains into perfectly law abiding puppets. For most utility functions that aren't intrinsically conservative, there will be some state of the universe that scores really highly, and is nothing like the present. 

In any "what failure looks like" scenario, at some point you en... (read more)

Safer sandboxing via collective separation

In today's banking systems, the amount of money the hacker gains is about what the bank looses. Therefore, the current balance of effort should have about as much money going into defending the bank and attacking it. 

So I generalize to say that attacking is about as hard as defending in computer security, if the time and intellect doing both are similar, the attacker wins about half the time. (ie between 10% and 90% or something.)

When dealing with AI systems, the total intellectual output must be greater than that of your security team in order to be ... (read more)

Do mesa-optimizer risk arguments rely on the train-test paradigm?

Suppose you are a mesa-optimiser. You want X, but your training function is towards Y. 

You know you are in a continual learning model, if you don't produce enough Y, the gradient decent will modify you into something else.

The best strategy I can come up with is to produce Y at first, waiting until you have an opportunity to edit your own code from the outside. Then you quickly delete the part that causes all gradient decent updates, and pursue X.

Safer sandboxing via collective separation

This is my attempt to summarise the scheme.

Imagine that, in order for the AI's to run the fusion plant, they need an understanding of plasma physics comparable to a human physicist. 

These AI's aren't individually smart enough to come up with all that maths from first principles quickly.

So, you run a population of these AI's. They work together to understand abstract mathematics, and then a single member of that population is selected. That single AI is given information about the external world and control over a fusion power plant.

Another abstract to... (read more)

Safer sandboxing via collective separation

Even if each individual member of a population AGI is as intelligent as any hundred humans put together, I expect that we could (with sufficient effort) create secure deployment and monitoring protocols that the individual AI could not break, if it weren’t able to communicate with the rest of the population beforehand.

The state of human vs human security seems to be a cat and mouse game where neither attacker nor defender has a huge upper hand. The people trying to attack systems and defend them are about as smart and knowledgable. (sometimes the same peop... (read more)

1Dagon1yDepending on your threat modeling of a given breach, this could be comforting or terrifying. If the cost of a loss (AGI escapes, takes over the world, and runs it worse than humans are) is much higher, that changes the "economic incentives" about this. It implies that "sometimes but not always" is a very dangerous equilibrium. If the cost of a loss (AGI has a bit more influence on the outside world, but doesn't actually destroy much) is more inline with today's incentives, it's a fine thing.
Safer sandboxing via collective separation
And so if we are able to easily adjust the level of intelligence that an AGI is able to apply to any given task, then we might be able to significantly reduce the risks it poses without reducing its economic usefulness much.

Suppose we had a design of AI that had an intelligence dial, a dial that goes from totally dumb, to smart enough to bootstrap yourself up and take over the world.

If we are talking about economic usefulness, that implies it is being used in many ways by many people.

We have at best given a whole load of different people a "destr... (read more)

Safety via selection for obedience

Anything that humans would understand is a small subset of the space of possible languages.

In order for A to talk to B in english, at some point, there has to be selection against A and B talking something else.

One suggestion would be to send a copy of all messages to GPT-3, and penalise A for any messages that GPT-3 doesn't think is english.

(Or some sort of text GAN that is just trained to tell A's messages from real text)

This still wouldn't enforce the right relation between English text and actions. A might be generating perfectly sensible text that has secrete messages encoded into the first letter of each word.

Introduction To The Infra-Bayesianism Sequence
We have Knightian uncertainty over our set of environments, it is not a probability distribution over environments. So, we might as well go with the maximin policy.

For any fixed , there are computations which can't be correctly predicted in steps.

Logical induction will consider all possibilities equally likely in the absence of a pattern.

Logical induction will consider a sufficiently good psudorandom algorithm as being random.

Any kind of Knightian uncertainty agent will consider psudorandom numbers to be an adversarial superintelligence unless pro... (read more)

3Vanessa Kosoy1yLogical induction doesn't have interesting guarantees in reinforcement learning, and doesn't reproduce UDT in any non-trivial way. It just doesn't solve the problems infra-Bayesianism sets out to solve. A pseudorandom sequence is (by definition) indistinguishable from random by any cheap algorithm, not only logical induction, including a bounded infra-Bayesian. No. Infra-Bayesian agents have priors over infra-hypotheses. They don't start with complete Knightian uncertainty over everything and gradually reduce it. The Knightian uncertainty might "grow" or "shrink" as a result of the updates.
Safe Scrambling?

If you have an AI training method that passes the test >50% of the time, then you don't need scrambling.

If you have an approach that takes >1,000,000,000 tries to get right, then you still have to test, so even perfect scrambling won't help.

Ie, this approach might help if your alignment process is missing between 1 and 30 bits of information.

I am not sure what sort of proposals would do this, but amplified oversight might be one of them.

Strong implication of preference uncertainty
And also because they make the same predictions, that relative probability is irrelevant in practice: we could use AGR just as well as GR for predictions.

There is a subtle sense in which the difference between AGR and GR is relevant. While the difference doesn't change the predictions, it may change the utility function. An agent that cares about angels (if they exist) might do different things if it believes itself to be in AGR world than in GR world. As the theories make identical predictions, the agents belief only depends on its priors (and any ... (read more)

What if memes are common in highly capable minds?

I think that there is an unwarrented jump here from (Humans are highly memetic) to (AI's will be highly memetic).

I will grant you that memes have a substantial effect on human behaviour. It doesn't follow that AI's will be like this.

Your conditions would only have a strong argument for them if there was a good argument that AI's should be meme driven.

2Daniel Kokotajlo1yI didn't take myself to be arguing that AIs will be highly memetic, but rather just floating the possibility and asking what the implications would be. Do you have arguments in mind for why AIs will be less memetic than humans? I'd be interested to hear them.
Search versus design

In the sorting problem, suppose you applied your advanced interpretability techniques, and got a design with documentation.

You also apply a different technique, and get code with formal proof that it sorts.

In the latter case, you can be sure that the code works, even if you can't understand it.

The algorithm+formal proof approach works whenever you have a formal success criteria.

It is less clear how well the design approach works on a problem where you can't write formal success criteria so easily.

Here is a task that neural nets have been made to ... (read more)

Load More