All of Donald Hobson's Comments + Replies

How do we prepare for final crunch time?

I don't actually think "It is really hard to know what sorts of AI alignment work are good this far out from transformative AI." is very helpful. 

It is currently fairly hard to tell what is good alignment work. A week from TAI, then either, good alignment work will be easier to recognise because of alignment progress not strongly correlated with capabilities, or good alignment research is just as hard to recognise. (More likely the latter) I can't think of any safety research that can be done on GPT3 that can't be done on GPT1. 

In my picture, res... (read more)

My AGI Threat Model: Misaligned Model-Based RL Agent

, it seems to me that under these assumptions there would probably be a series of increasingly-worse accidents spread out over some number of years, culminating in irreversible catastrophe, with humanity unable to coordinate to avoid that outcome—due to the coordination challenges in Assumptions 2-4.

I'm not seeing quite what the bad but not existential catastrophes would look like. I also think the AI has an incentive not to do this. My world model (assuming slow takeoff) goes more like this.

AI created in lab. Its a fairly skilled programmer and hacker. Ab... (read more)

1Steve Byrnes18dHmm, I dunno, I haven't thought it through very carefully. But I guess an AGI might require a supercomputer of resources and maybe there are only so many hackable supercomputers of the right type, and the AI only knows one exploit and leaves traces of its hacking that computer security people can follow, and meanwhile self-improvement is hard and slow (for example, in the first version you need to train for two straight years, and in the second self-improved version you "only" need to re-train for 18 months). If the AI can run on a botnet then there are more options, but maybe it can't deal with latency / packet loss / etc., maybe it doesn't know a good exploit, maybe security researchers find and take down the botnet C&C infrastructure, etc. Obviously this wouldn't happen with a radically superhuman AGI but that's not what we're talking about. But from my perspective, this isn't a decision-relevant argument. Either we're doomed in my scenario or we're even more doomed in yours. We still need to do the same research in advance. Well, we can be concerned about non-corrigible systems that act deceptively (cf. "treacherous turn"). And systems that have close-but-not-quite-right goals such that they're trying to do the right thing in test environments, but their goals veer away from humans' in other environments, I guess.
HCH Speculation Post #2A

In the giant lookup table space, HCH must converge to a cycle, although that convergence can be really slow. I think you have convergence to a stationary distribution if each layer is trained on a random mix of several previous layers. Of course, you can still have occilations in what is said within a policy fixed point. 

HCH Speculation Post #2A

If you want to prove things about fixed points of HCH in an iterated function setting, consider it a function from policies to policies. Let M be the set of messages (say ascii strings < 10kb.) Given a giant look up table T that maps M to M, we can create another giant look up table. For each m in M , give a human in a box the string m, and unlimited query access to T. Record their output.

The fixed points of this are the same as the fixed points of HCH. "Human with query access to" is a function on the space of policies.

1Charlie Steiner1moSure, but the interesting thing to me isn't fixed points in the input/output map, it's properties (i.e. attractors that are allowed to be large sets) that propagate from the answers seen by a human in response to their queries, into their output. Even if there's a fixed point, you have to further prove that this fixed point is consistent - that it's actually the answer to some askable question. I feel like this is sort of analogous to Hofstadter's q-sequence.
Comments on "The Singularity is Nowhere Near"

Tim Dettmers whole approach seems to be assuming that there are no computational shortcuts. No tricks that programmers can use for speed where evolution brute forced it. For example, maybe a part of the brain is doing a convolution by the straight forward brute force algorithm. And programmers can use fast fourier transform based convolutions. Maybe some neurons are discrete enough for us to use single bits. Maybe we can analyse the dimensions of the system and find that some are strongly attractive, and so just work in that subspace. 

Of course, all t... (read more)

Donald Hobson's Shortform

Yes. If you have an AI that has been given a small, easily completable task, like putting one block on top of another with a robot arm, that is probably just going to do your simple task. The idea is that you build a fairly secure box, and give the AI a task it can fairly easily achieve in that box. (With you having no intention of pressing the button so long as the AI seems to be acting normally. ) We want to make "just do your task" the best strategy.  If the box is less secure than we thought, or various other things go wrong, the AI will just shut... (read more)

Donald Hobson's Shortform

Here is a potential solution to stop button type problems, how does this go wrong?

Taking into account uncertainty, the algorithm is.

Calculate the X maximizing best action in a world where the stop button does nothing.

Calculate the X maximizing best action in a world where the stop button works. 

If they are the same, do that. Otherwise shutdown.

0Measure1moIt seems like the button-works action will usually be some variety of "take preemptive action to ensure the button won't be pressed" and so the AI will have a high chance to shut down at each decision step.
Donald Hobson's Shortform

 rough stop button problem ideas.

You want an AI that believes its actions can't effect the button. You could use causal counterfactuals. An imaginary button that presses itself at random. You can scale the likelihood of worlds up and down, to ensure the button is equally likely to be pressed in each world. (Wierd behaviour, not recomended) You can put the AI in the logical counterfactual of "my actions don't influence the chance the button is pressed." if you can figure out logical counterfactuals.

Or you can get the AI to simulate what it would do if it were an X maximizer. If it thinks the button won't be pressed, it does that, otherwise it does nothing. (not clear how to generalize to uncertain AI)

1Donald Hobson1moHere is a potential solution to stop button type problems, how does this go wrong? Taking into account uncertainty, the algorithm is. Calculate the X maximizing best action in a world where the stop button does nothing. Calculate the X maximizing best action in a world where the stop button works. If they are the same, do that. Otherwise shutdown.
Non-Obstruction: A Simple Concept Motivating Corrigibility

This definition of a non-obstructionist AI takes what would happen if it wasn't switched on as the base case. 

This can give weird infinite hall of mirrors effects if another very similar non-obstructionist AI would have been switched on, and another behind them. (Ie a human whose counterfactual behaviour on AI failure is to reboot and try again.) This would tend to lead to a kind of fixed point effect, where the attainable utility landscape is almost identical with the AI on and off. At some point it bottoms out when the hypothetical U utility humans ... (read more)

1Alex Turner2moThanks for leaving this comment. I think this kind of counterfactual is interesting as a thought experiment, but not really relevant to conceptual analysis using this framework. I suppose I should have explained more clearly that the off-state counterfactual was meant to be interpreted with a bit of reasonableness, like "what would we reasonably do if we, the designers, tried to achieve goals using our own power?". To avoid issues of probable civilizational extinction by some other means soon after without the AI's help, just imagine that you time-box the counterfactual goal pursuit to, say, a month. I can easily imagine what my (subjective) attainable utility would be if I just tried to do things on my own, without the AI's help. In this counterfactual, I'm not really tempted to switch on similar non-obstructionist AIs. It's this kind of counterfactual that I usually consider for AU landscape-style analysis, because I think it's a useful way to reason [https://www.lesswrong.com/s/7CdoznhJaLEKHwvJW/p/fj8eyc7QzqCaB8Wgm] about how the world is changing.
A Critique of Non-Obstruction

What if, the moment the AI boots up, a bunch of humans tell it "our goals aren't on a spike." (It could technically realize this based on anthropic reasoning. If humans really wanted to maximize paperclips, and its easy to build a paperclip maximizer, we wouldn't have built a non-obstructive AI.)

We are talking policies here. If the humans goals were on a spike, they wouldn't have said that. So If the AI takes the policy of giving us a smoother attainable utility function in this case,  this still fits the bill. 

Actually I think that this definiti... (read more)

1Joe_Collman2moI think things are already fine for any spike outside S, e.g. paperclip maximiser, since non-obstruction doesn't say anything there. I actually think saying "our goals aren't on a spike" amounts to a stronger version of my [assume humans know what the AI knows as the baseline]. I'm now thinking that neither of these will work, for much the same reason. (see below) The way I'm imagining spikes within S is like this: We define a pretty broad S, presumably implicitly, hoping to give ourselves a broad range of non-obstruction. For all P in U we later conclude that our actual goals are in T ⊂ U ⊂S. We optimize for AU on T, overlooking some factors that are important for P in U \ T. We do better on T than we would have by optimising more broadly over U (we can cut corners in U \ T). We do worse on U \ T since we weren't directly optimising for that set (AU on U \ T varies quite a lot). We then get an AU spike within U, peaking on T. The reason I don't think telling the AI something like "our goals aren't on a spike" will help, is that this would not be a statement about our goals, but about our understanding and competence. It'd be to say that we never optimise for a goal set we mistakenly believe includes our true goals (and that we hit what we aim for similarly well for any target within S). It amounts to saying something like "We don't have blind-spots", "We won't aim for the wrong target", or, in the terms above, "We will never mistake any T for any U". In this context, this is stronger and more general than my suggestion of "assume for the baseline that we know everything you know". (lack of that knowledge is just one way to screw up the optimisation target) In either case, this is equivalent to telling the AI to assume an unrealistically proficient/well-informed pol. The issue is that, as far as non-obstruction is concerned, the AI can then take actions which have arbitrarily bad consequences for us if we don't perform as well as pol. I.e. non-obstruction th
Optimal play in human-judged Debate usually won't answer your question

Neural nets have adversarial examples. Adversarial optimization of part of the input can make the network do all sorts of things, including computations.

If you optimise the inputs to a buggy program hard enough, you get something that crashes the program in a way that happens to score highly. 

I suspect that optimal play on most adversarial computer games looks like a game of core wars. https://en.wikipedia.org/wiki/Core_War

Of course, if we really have myopic debate, not any mesaoptimisers, then neither AI is optimizing to have a long term effect or to... (read more)

2Joe_Collman3moSure - there are many ways for debate to fail with extremely capable debaters. Though most of the more exotic mind-hack-style outcomes seem a lot less likely once you're evaluating local nodes with ~1000 characters for each debater. However, all of this comes under my: I’ll often omit the caveat “If debate works as intended aside from this issue…” There are many ways for debate to fail. I'm pointing out what happens even if it works. I.e. I'm claiming that question-ignoring will happen even if the judge is only ever persuaded of true statements, gets a balanced view of things, and is neither manipulated, nor mind-hacked (unless you believe a response of "Your house is on fire" to "What is 2 + 2?" is malign, if your house is indeed on fire). Debate can 'work' perfectly, with the judge only ever coming to believe true statements, and your questions will still usually not be answered. (because [believing X is the better answer to the question] and [deciding X should be the winning answer, given the likely consequences] are not the same thing) The fundamental issue is: [what the judge most wants] is not [the best direct answer to the question asked].
What technologies could cause world GDP doubling times to be <8 years?

"Do paperclips count as GDP" (Quote from someone)

What is GDP doing in a grey goo scenario. What if there are actually several types of goo that are trading mass and energy between each other? 

What about an economy in which utterly vast amounts of money are being shuffled around on computers, but not that much is actually being produced.

There are a bunch of scenarios where GDP could reasonably be interpreted as multiple different quantities. In the last case, once you decide whether virtual money counts or not, then GDP is a useful measure of what is going on, but measures something different in each case.

What technologies could cause world GDP doubling times to be <8 years?

Excluding AI, and things like human intelligence enhancement, mind uploading ect.

I think that the biggest increases in the economy would be from more automated manufacturing. The extreme case is fully programmable molecular nanotech. The sort that can easily self replicate and where making anything is as easy as saying where to put the atoms. This would potentially lead to a substantially faster economic growth rate than 9%. 

There are various ways that the partially developed tech might be less powerful.

Maybe the nanotech uses a lot of energy, or some... (read more)

Misalignment and misuse: whose values are manifest?

I think that you have a 4th failure mode. Moloch.

Confucianism in AI Alignment

If an inner optimizer could exploit some distribution shift between the training and deployment environments, then performance-in-training is a bad proxy for performance-in-deployment.

Suppose you are making a self driving car. The training environment is a videogame like environment. The rendering is pretty good. A human looking at the footage would not easily be able to say it was obviously fake. An expert going over the footage in detail could spot subtle artefacts. The diffuse translucency on leaves in the background isn't quite right. When another car ... (read more)

The date of AI Takeover is not the day the AI takes over

But this isn’t quite right, at least not when “AI takeover” is interpreted in the obvious way, as meaning that an AI or group of AIs is firmly in political control of the world, ordering humans about, monopolizing violence, etc. Even if AIs don’t yet have that sort of political control, it may already be too late.

The AI's will probably never be in a position of political control. I suspect the AI would bootstrap self-replicating (nano?) tech. It might find a way to totally brainwash people, and spread it across the internet. The end game is always going to... (read more)

1Daniel Kokotajlo6moI think this depends on how fast the takeoff is. If crossing the human range, and recursive self-improvement, take months or years rather than days, there may be an intermediate period where political control is used to get more resources and security. Politics can happen on a timespan of weeks or months. Brainwashing people is a special case of politics. Yeah I agree the endgame is always nanobot swarms etc.
Needed: AI infohazard policy

Suppose you think that both capabilities and alignment behave like abstract quantities, ie real numbers.

And suppose that you think there is a threshold amount of alignment, and a threshold amount of capabilities, making a race to which threshold is reached first. 

If you also assume that the contribution of your research is fairly small, and our uncertainty about the threshold locations is high, 

then we have the heuristic, only publish your research if the ratio between capabilities and alignment that it produces is better than the ratio over all ... (read more)

0Vanessa Kosoy7moHmm, so in this model we assume that (i) the research output of the rest of the world is known (ii) we are deciding about one result only (iii) the thresholds are unknown. In this case you are right that we need to compare our alignment : capability ratio to the rest of the world's alignment : capability ratio. Now assume that, instead of just one result overall, you produce a single result every year. Most of the results in the sequence have alignment : capability ratio way above the rest of the world, but then there is a year in which the ratio is only barely above the rest of the world. In this case, you are better off not publishing the irregular result, even though the naive ratio criterion says to publish. We can reconcile it with the previous model by including your own research in the reference, but it creates a somewhat confusing self-reference. Second, we can switch to modeling the research output of the rest of the world as a random walk. In this case, if the average direction of progress is pointing towards failure, then moving along this direction is net negative, since it reduces the chance to get success by luck.
Clarifying “What failure looks like” (part 1)

I think that most easy to measure goals, if optimised hard enough, eventually end up with a universe tiled with molecular smiley faces. Consider the law enforcement AI. There is no sharp line between education programs, and reducing lead pollution, to using nanotech to rewire human brains into perfectly law abiding puppets. For most utility functions that aren't intrinsically conservative, there will be some state of the universe that scores really highly, and is nothing like the present. 

In any "what failure looks like" scenario, at some point you en... (read more)

Safer sandboxing via collective separation

In today's banking systems, the amount of money the hacker gains is about what the bank looses. Therefore, the current balance of effort should have about as much money going into defending the bank and attacking it. 

So I generalize to say that attacking is about as hard as defending in computer security, if the time and intellect doing both are similar, the attacker wins about half the time. (ie between 10% and 90% or something.)

When dealing with AI systems, the total intellectual output must be greater than that of your security team in order to be ... (read more)

Do mesa-optimizer risk arguments rely on the train-test paradigm?

Suppose you are a mesa-optimiser. You want X, but your training function is towards Y. 

You know you are in a continual learning model, if you don't produce enough Y, the gradient decent will modify you into something else.

The best strategy I can come up with is to produce Y at first, waiting until you have an opportunity to edit your own code from the outside. Then you quickly delete the part that causes all gradient decent updates, and pursue X.

Safer sandboxing via collective separation

This is my attempt to summarise the scheme.

Imagine that, in order for the AI's to run the fusion plant, they need an understanding of plasma physics comparable to a human physicist. 

These AI's aren't individually smart enough to come up with all that maths from first principles quickly.

So, you run a population of these AI's. They work together to understand abstract mathematics, and then a single member of that population is selected. That single AI is given information about the external world and control over a fusion power plant.

Another abstract to... (read more)

Safer sandboxing via collective separation

Even if each individual member of a population AGI is as intelligent as any hundred humans put together, I expect that we could (with sufficient effort) create secure deployment and monitoring protocols that the individual AI could not break, if it weren’t able to communicate with the rest of the population beforehand.

The state of human vs human security seems to be a cat and mouse game where neither attacker nor defender has a huge upper hand. The people trying to attack systems and defend them are about as smart and knowledgable. (sometimes the same peop... (read more)

1Dagon7moDepending on your threat modeling of a given breach, this could be comforting or terrifying. If the cost of a loss (AGI escapes, takes over the world, and runs it worse than humans are) is much higher, that changes the "economic incentives" about this. It implies that "sometimes but not always" is a very dangerous equilibrium. If the cost of a loss (AGI has a bit more influence on the outside world, but doesn't actually destroy much) is more inline with today's incentives, it's a fine thing.
Safer sandboxing via collective separation
And so if we are able to easily adjust the level of intelligence that an AGI is able to apply to any given task, then we might be able to significantly reduce the risks it poses without reducing its economic usefulness much.

Suppose we had a design of AI that had an intelligence dial, a dial that goes from totally dumb, to smart enough to bootstrap yourself up and take over the world.

If we are talking about economic usefulness, that implies it is being used in many ways by many people.

We have at best given a whole load of different people a "destr... (read more)

Safety via selection for obedience

Anything that humans would understand is a small subset of the space of possible languages.

In order for A to talk to B in english, at some point, there has to be selection against A and B talking something else.

One suggestion would be to send a copy of all messages to GPT-3, and penalise A for any messages that GPT-3 doesn't think is english.

(Or some sort of text GAN that is just trained to tell A's messages from real text)

This still wouldn't enforce the right relation between English text and actions. A might be generating perfectly sensible text that has secrete messages encoded into the first letter of each word.

Introduction To The Infra-Bayesianism Sequence
We have Knightian uncertainty over our set of environments, it is not a probability distribution over environments. So, we might as well go with the maximin policy.

For any fixed , there are computations which can't be correctly predicted in steps.

Logical induction will consider all possibilities equally likely in the absence of a pattern.

Logical induction will consider a sufficiently good psudorandom algorithm as being random.

Any kind of Knightian uncertainty agent will consider psudorandom numbers to be an adversarial superintelligence unless pro... (read more)

3Vanessa Kosoy8moLogical induction doesn't have interesting guarantees in reinforcement learning, and doesn't reproduce UDT in any non-trivial way. It just doesn't solve the problems infra-Bayesianism sets out to solve. A pseudorandom sequence is (by definition) indistinguishable from random by any cheap algorithm, not only logical induction, including a bounded infra-Bayesian. No. Infra-Bayesian agents have priors over infra-hypotheses. They don't start with complete Knightian uncertainty over everything and gradually reduce it. The Knightian uncertainty might "grow" or "shrink" as a result of the updates.
Safe Scrambling?

If you have an AI training method that passes the test >50% of the time, then you don't need scrambling.

If you have an approach that takes >1,000,000,000 tries to get right, then you still have to test, so even perfect scrambling won't help.

Ie, this approach might help if your alignment process is missing between 1 and 30 bits of information.

I am not sure what sort of proposals would do this, but amplified oversight might be one of them.

Strong implication of preference uncertainty
And also because they make the same predictions, that relative probability is irrelevant in practice: we could use AGR just as well as GR for predictions.

There is a subtle sense in which the difference between AGR and GR is relevant. While the difference doesn't change the predictions, it may change the utility function. An agent that cares about angels (if they exist) might do different things if it believes itself to be in AGR world than in GR world. As the theories make identical predictions, the agents belief only depends on its priors (and any ... (read more)

What if memes are common in highly capable minds?

I think that there is an unwarrented jump here from (Humans are highly memetic) to (AI's will be highly memetic).

I will grant you that memes have a substantial effect on human behaviour. It doesn't follow that AI's will be like this.

Your conditions would only have a strong argument for them if there was a good argument that AI's should be meme driven.

2Daniel Kokotajlo8moI didn't take myself to be arguing that AIs will be highly memetic, but rather just floating the possibility and asking what the implications would be. Do you have arguments in mind for why AIs will be less memetic than humans? I'd be interested to hear them.
Search versus design

In the sorting problem, suppose you applied your advanced interpretability techniques, and got a design with documentation.

You also apply a different technique, and get code with formal proof that it sorts.

In the latter case, you can be sure that the code works, even if you can't understand it.

The algorithm+formal proof approach works whenever you have a formal success criteria.

It is less clear how well the design approach works on a problem where you can't write formal success criteria so easily.

Here is a task that neural nets have been made to ... (read more)

Search versus design

I am not sure that designed artefacts are automatically easily interpretable.

If an engineer is looking over the design of the latest smartphone, then the artefact is similar to previous artefacts they have experience with. This will include a lot of design details about chip architecture and instruction set. The engineer will also have the advantage of human written spec sheets.

If we sent a pile of smartphones to Issac Newton, he wouldn't have either of these advantages. He wouldn't be able to figure out much about how they worked.

There are 3 f... (read more)

3Alex Flint8moIt is certainly not the case that designed artifacts are easily interpretable. An unwieldy and poorly documented codebase is a design artifact that is not easily interpretable. Design at its best can produce interpretable artifacts, whereas the same is not true for machine learning. The interpretability of artifacts is not a feature of the artifact itself but of the pair (artifact, story), or you might say (artifact, documentation). We design artifacts in such a way that it is possible to write documentation such that the (artifact, documentation) pair facilitates interpretation by humans. Hmm well if it's possible for anyone at all in the modern world to understand how a smartphone works by, say, age 30, then that means it takes no more than 30 years of training to learn from scratch everything you need to understand how a smartphone works. Presumably Newton would be quite capable of learning that information in less than 30 years. Now here I'm assuming that we send Newton the pair (artifact, documentation), where "documentation" is whatever corpus of human knowledge is needed as background material to understand a smartphone. This may include substantially more than just that which we would normally think of as "documentation for a smartphone". But it is possible to digest all this information because humans are born today without any greater pre-existing understanding of smartphones than Newton had, and yet some of them do go on to understand smartphones. Yeah good point re similarity to previous designs. I'll think more on that.
1Charlie Steiner8moThere's also the "Plato's man" type of problem where undesirable things fall under the advanced definition of interpretability. For example, ordinary neural nets are "interpretable," because they are merely made out of interpretable components (simple matrices with a non-linearity) glued together.
Alignment By Default

Consider a source of data that is from a sum of several Gaussian distributions. If you have a sufficiently large number of samples from this distribution, you can locate the origional gaussians to arbitrary accuracy. (Of course, if you have a finite number of samples, you will have some inaccuracy in predicting the location of the gaussians, possibly a lot.)

However, not all distributions share this property. If you look at uniform distributions over rectangles in 2d space, you will find that a uniform L shape can be made in 2 different ways. More complicat... (read more)

Alignment By Default
Note that the examples in the OP are from an adversarial generative network. If its notion of "tree" were just "green things", the adversary should be quite capable of exploiting that.

In order for the network to produce good pictures, the concept of "tree" must be hidden in there somewhere, but it could be hidden in a complicated and indirect manor. I am questioning whether the particular single node selected by the researchers encodes the concept of "tree" or "green thing".

1johnswentworth8moAh, I see. You're saying that the embedding might not actually be simple. Yeah, that's plausible.
Alignment By Default
when alignment-by-default works, we can use the system to design a successor without worrying about amplification of alignment errors

Anything neural net related starts with random noise and performs gradient descent style steps. This doesn't get you the global optimal, it gets you some point that is approximately a local optimal, which depends on the noise, the nature of the search space, and the choice of step size.

If nothing else, the training data will contain sensor noise.

At best you are going to get something that roughly corresponds to human v... (read more)

1johnswentworth8moThat's not quite how natural abstractions work. There are lots of edge cases which are sort-of-trees-but-sort-of-not: logs, saplings/acorns, petrified trees, bushes, etc. Yet the abstract category itself is still precise. An analogy: consider a Gaussian cluster model. Any given cluster will have lots of edge cases, and lots of noise in the individual points. But the cluster itself - i.e. the mean and variance parameters of the cluster - can still be precisely defined. Same with the concept of "tree", and (I expect) with "human values". In general, we can have a precise high-level concept without a hard boundary in the low-level space.
Alignment By Default
In light of the previous section, there’s an obvious path to alignment where there turns out to be a few neurons (or at least some simple embedding) which correspond to human values, we use the tools of microscope AI to find that embedding, and just like that the alignment problem is basically solved.

This is the part I disagree with. The network does recognise trees, or at least green things (given that the grass seems pretty brown in the low tree pic).

Extrapolating this, I expect the AI might well have neurons that correspond roughly to human ... (read more)

1johnswentworth8moNote that the examples in the OP are from an adversarial generative network. If its notion of "tree" were just "green things", the adversary should be quite capable of exploiting that. The whole point of the "natural abstractions" section of the OP is that I do not think this will actually happen. Off-distribution behavior is definitely an issue for the "proxy problems" section of the post, but I do not expect it to be an issue for identifying natural abstractions.
Alignment By Default
So in principle, it doesn’t even matter what kind of model we use or how it’s represented; as long the predictive power is good enough, values will be embedded in there, and the main problem will be finding the embedding.

I will agree with this. However, notice what this doesn't say. It doesn't say "any model powerful enough to be really dangerous contains human values". Imagine a model that was good at a lot of science and engineering tasks. It was good enough at nuclear physics to design effective fusion reactors and b... (read more)

1johnswentworth8moThis is entirely correct.
Measuring hardware overhang

Your results don't seem to show upper bounds on the amount of hardware overhang, or very strong lower bounds. What concrete progress has been made in chess playing algorithms since deep blue? As far as I can tell, it is a collection of reasonably small, chess specific tricks. Better opening libraries, low level performance tricks, tricks for fine tuning evaluation functions. Ways of representing chessboards in memory that shave 5 processor cycles off computing a knights move ect.

P=PSPACE is still an open conjecture. Chess is brute forcible in PSPACE.... (read more)

The Fusion Power Generator Scenario

Any of the risks of being like a group of humans, only much faster, apply. There are also the mesa alignment issues. I suspect that a sufficiently powerful GPT-n might form deceptively aligned mesa optimisers.

I would also worry that off distribution attractors could be malign and intelligent.

Suppose you give GPT-n an off training distribution prompt. You get it to generate text from this prompt. Sometimes it might wander back into the distribution, other times it might stay off distribution. How wide is the border between processes that are safely immitat... (read more)

The Fusion Power Generator Scenario

If you ask GPT-n to produce a design for a fusion reactor, all the prompts that talk about fusion are going to say that a working reactor hasn't yet been built, or imitate cranks or works of fiction.

It seems unlikely that a text predictor could pick up enough info about fusion to be able to design a working reactor, without figuring out that humans haven't made any fusion reactors that produce net power.

If you did somehow get a response, the level of safety you would get is the level a typical human would display. (conditional on the prompt) If ... (read more)

1G Gordon Worley III8moI somewhat hopeful that this is right, but I'm also not so confident that I feel like we can ignore the risks of GPT-N. For example, this post [https://entirelyuseless.com/2020/08/01/some-remarks-on-gpt-n/] makes the argument that, because of GPT's design and learning mechanism, we need not worry about it coming up with significantly novel things or outperforming humans because it's optimizing for imitating existing human writing, not saying true things. On the other hand, it's managing to do powerful things it wasn't trained for, like solve math equations we have no reason to believe it saw in the training set or write code hasn't seen before, which makes it possible that even if GPT-N isn't trained to say true things and isn't really capable of more than humans are, doesn't mean it might not function like a Hansonian em and still be dangerous by simply doing what humans can do, only much faster.
Infinite Data/Compute Arguments in Alignment

I would say the reason to assume infinite compute was less about which parts of the problem are hard, and more about which parts can be solved without a solution to the rest.

Good solutions often have even better solutions nearby. In particular, we would expect most efficient and comprehensible finite algorithms to tend towards some nice infinite behaviour in the limit. If we find an infinite algorithm, that's a good point to start looking for finite approximations. It is also often easier to search for an infinite algorithm than a good approximation. Backpropigation in gradient descent is a trickier algorithm than brute force search. Logical induction is more complicated to understand than brute force proof search.

Three mental images from thinking about AGI debate & corrigibility

AI systems don't spontaneously develop deficiencies.

And the human can't order the AI to search for and stop any potentially uncorrectable deficiencies it might make. If the system is largely working, the human and the AI should be working together to locate and remove deficiencies. To say that one persists, is to say that all strategies tried by the human and the part of the AI that wants to remove deficiencies fails.


The whole point of a corrigable design, is that it doesn't think like that. If it doesn't accept the command, it says so... (read more)

1Steve Byrnes8moWell, if you're editing the AI system by gradient descent with loss function L, then it won't spontaneously develop a deficiency in minimizing L, but it could spontaneously develop a "deficiency" along some other dimension Q that you care about that is not perfectly correlated with L. That's all I meant. If we were talking about "gradient ascent on corrigibility", then of course the system would never develop a deficiency with respect to corrigibility. But that's not the proposal, because we don't currently have a formula for corrigibility. So the AI is being modified in a different way (learning, reflecting, gradient-descent on something other than corrigibility, self-modification, whatever), and so spontaneous development of a deficiency in corrigibility can't be ruled out, right? Or sorry if I'm misunderstanding. I assume you meant "can" in the first sentence, correct? I'm sympathetic to the idea of having AIs help with AI alignment research insofar as that's possible. But the AIs aren't omniscient (or if they are, it's too late). I think that locating and removing deficiencies in corrigibility is a hard problem for humans, or at least it seems hard to me, and therefore failure to do it properly shouldn't be ruled out, even if somewhat-beyond-human-level AIs are trying to help out too. Remember, the requirement is not just to discover existing deficiencies in corrigibility, but to anticipate and preempt any possible future reduction in corrigibility. The system doesn't know what thoughts it will think next week, and therefore it's difficult to categorically rule out that it may someday learning new information or having a new insight that, say, spawns an ontological crisis that leads to a reconceptualization of the meaning of "corrigibility" or rethinking of how to achieve it, in a direction that it would not endorse from its present point-of-view. How would you categorically rule out that kind of thing, well in advance of it happening? Decide to just stop thi
Three mental images from thinking about AGI debate & corrigibility

I think about the broad basin of corrigibility like this.

Suppose you have a space probe, and you can only communicate with it by radio.

If it is running software that listens to the radio and will reprogram itself if the right message is sent, then you are in the broad basin of reprogramability. This basin is broad in the sense that the probe could accept many different programming languages, and if you are in that basin, you can change which language the probe accepts from the ground. If you wander to the edge of the basin, and upload a weird and awkward ... (read more)

2Steve Byrnes8moI agree that this kind of dynamic works for most possible deficiencies in corrigibility; my argument is that it doesn't work for every last possible deficiency, and thus over a long enough run-time, the system is likely to at some point develop deficiencies that don't self-correct. One category of possibly uncorrectable deficiency is goal drift. Let's say the system comes to believe that it's important to take care of the operator's close friends, such that if there operator turns on her friends at some point and tries to undermine them, the system would not respect those intentions. Now how would that problem fix itself? I don't think it would! If the operator says "please modify yourself to obey me no matter what, full stop", the system won't do it! it will leave in that exception for not undermining current friends, and lie about having that exception. Other categories of possibly-not-self-correcting drift would be failures of self-understanding (some subsystem does something that undermines corrigibility, but the top-level system doesn't realize it, and makes bad self-modification decisions on that basis), and distortions in the system's understanding of what humans mean when they talk, or what they want in regards to corrigibility, etc. Do you see what I'm getting at?
The "best predictor is malicious optimiser" problem

I don't think that these Arthur merlin proofs are relevant. Here A has a lot more compute than P. A is simulating P and can see and modify P however A sees fit.

The "best predictor is malicious optimiser" problem

Thanks for a thoughtful comment.

On the other hand, maybe this could still be dangerous, if P and P' have shared instrumental goals with regards to your predictions for B?

Assuming that P and P' are perfectly antialigned, they won't cooperate. However they need to be really antialigned for this to work. If there is some obscure borderline that P thinks is a paperclip, and P' thinks isn't, they can work together to tile the universe with it.

I don't think it would bed that easy to change evolution into a reproductive fitness mi... (read more)

The "best predictor is malicious optimiser" problem

There is precisely one oracle, . and and are computable. And crucially, the oracles answer does not depend on itself in any way. This question is not self referential. might try to predict and , but there is no guarantee it will be correct.


P has a restricted amount of compute compared to A, but still enough to be able to reason about A in the abstact.


We are asking how we should design A.

If you have unlimited compute, and want to predict something, you can use solomnov induction. But some of the hypothesis you might find are AI's that ... (read more)

0Dmitry Vaintrob9moSorry, I misread this. I read your question as O outputting some function T that is most likely to answer some set of questions you want to know the answer to (which would be self-referential as these questions depend on the output of T). I think I understand your question now. What kind of ability do you have to know the "true value" of your sequence B? If the paperclip maximizer P is able to control the value of your turing machine, and if you are a one-boxing AI (and this is known to P) then of course you can make deals/communicate with P. In particular, if the sequence B is generated by some known but slow program, you can try to set up an Arthur-Merlin zero knowledge proof protocol [https://en.wikipedia.org/wiki/Arthur%E2%80%93Merlin_protocol] in exchange for promising to make a few paperclips, which you can then use to keep P honest (after making the paperclips as promised). To be clear though, this is a strategy for an agent A that somehow has as its goals only the desire to compute B together with some kind of commitment to following through on agreements. If A is genuinely aligned with humans, the rule "don't communicate/make deals with malicious superintelligent entities, at least until you have satisfactorily solved the AI in a box and similar underlying problems" should be a no-brainer.
Are we in an AI overhang?

It means that if there are approaches that don't need as much compute, the AI can invent them fast.

Environments as a bottleneck in AGI development

I agree that sufficiently powerful evolution or reinforcement learning will create AGI in the right environment. However, I think this might be like training gpt3 to do arithmetic. It works, but only by doing a fairly brute force search over a large space of designs. If we actually understood what we were doing, we could make far more efficient agents. I also think that such a design would be really hard to align.

Alignment proposals and complexity classes

The notion of complexity classes is well defined in an abstract mathematical sense. Its just that the abstract mathematical model is sufficiently different from an actual real world AI that I can't see any obvious correspondence between the two.

All complexity theory works in the limiting case of a series of ever larger inputs. In most everyday algorithms, the constants are usually reasonable, meaning that a loglinear algorithm will probably be faster than an expspace one on the amount of data you care about.

In this context, its not clear what it mea... (read more)

AI Benefits Post 2: How AI Benefits Differs from AI Alignment & AI for Good
A system can be aligned in most of these senses without being beneficial. Being beneficial is distinct from being aligned in senses 1–4 because those deal only with the desires of a particular human principal, which may or may not be beneficial. Being beneficial is distinct from conception 5 because beneficial AI aims to benefit many or all moral patients. Only AI that is aligned in the sixth sense would be beneficial by definition. Conversely, AI need not be well-aligned to be beneficial (though it might help).

At some point, some particular ... (read more)

AI Benefits Post 2: How AI Benefits Differs from AI Alignment & AI for Good
Equality: Benefits are distributed fairly and broadly.[2]

This sounds, at best like a consequence of the fact that human utility functions are sub linear in resources.

Democratization: Where possible, AI Benefits decisionmakers should create, consult with, or defer to democratic governance mechanisms.

Are we talking about decision making in a pre or post superhuman AI setting? In a pre ASI setting, it is reasonable for the people building AI systems to defer somewhat to democratic governance mechanisms, where their demands are well considered and sensible. (... (read more)

How "honest" is GPT-3?

Many of the previous pieces of text on the internet that are in "Human: ..., AI: ..." format were produced by less advanced AI's. If GPT-3 had noticed the pattern that text produced by an AI generally makes less sense than human produced text, then it might be deliberately not making sense, in order to imitate less advanced AI's.

Gwern said that if you give it a prompt with spelling mistakes in, it outputs text containing spelling mistakes, so this kind of deliberately producing low quality text is possible. Then again, I suspect the training dataset includes far more spelling mistake filled text than AI generated text.

3gwern9moThis has been suggested a few times before, and I've wondered it as well, but I don't really notice any difference in quality between prompts that explicitly mention or invoke an AI (like the chatbot dialogue one or the Transformer Poetry ones) and the ones which don't. I suspect there is actually very little real AI/human dialogue text online (simply because most of it is way too bad to bother quoting or storing outside of large ML datasets), and it may well be outweighed by all the fictional dialogues (where of course the AI talks just as well as the human because anything else would be boring).
Load More