Wiki Contributions


I agree that in a fast takeoff scenario there's little reason for an AI system to operate withing existing societal structures, as it can outgrow them quicker than society can adapt. I'm personally fairly skeptical of fast takeoff (<6 months say) but quite worried that society may be slow enough to adapt that even years of gradual progress with a clear sign that transformative AI is on the horizon may be insufficient.

In terms of humans "owning" the economy but still having trouble getting what they want, it's not obvious this is a worse outcome than the society we have today. Indeed this feels like a pretty natural progression of human society. Humans already interact with (and not so infrequently get tricked or exploited by) entities smarter than them such as large corporations or nation states. Yet even though I sometimes find I've bought a dud on the basis of canny marketing, overall I'm much better off living in a modern capitalist economy than the stone age where humans were more directly in control.

However, it does seem like there's a lot of value lost in the scenario where humans become increasingly disempowered, even if their lives are still better than in 2022. From a total utilitarian perspective, "slightly better than 2022" and "all humans dead" are rounding errors relative to "possible future human flourishing". But things look quite different under other ethical views, so I'm reluctant to conflate these outcomes.

Thanks for this response, I'm glad to see more public debate on this!

The part of Katja's part C that I found most compelling was the argument that for a given AI system its best interests might be to work within the system rather than aiming to seize power. Your response argues that even if this holds true for AI systems that are only slightly superhuman, eventually we will cross a threshold where a single AI system can takeover. This seems true if we hold the world fixed -- there is some sufficiently capable AI system that can take over the 2022 world. But this capability threshold is a moving target: humanity will get better at aligning and controlling AI systems as we gain more experience with them, and we may be able to enlist the help of AI systems to keep others in check. So, why should we expect the equilibrium here to be an AI takeover, rather than AIs working for humans because that it is in their selfish best interest in a market economy where humans are currently the primary property owner?

I think the crux here is whether we expect AI systems to by default collude with one another. They might -- they have a lot of things in common that humans don't, especially if they're copies of one another! But coordination in general is hard, especially if it has to be surreptitious.

As an analogy, I could argue that for much of human history soldiers were only slightly more capable than civilians. Sure, a trained soldier with a shield and sword is a fearsome opponent, but a small group of coordinated civilians could be victorious. Yet as we develop more sophisticated weapons such as guns, cannons, missiles, the power that a single soldier has grows greater and greater. So, by your argument, eventually a single soldier will be powerful enough to take over the world.

This isn't totally fanciful -- the Spanish conquest of the Inca Empire started with just 168 soldiers! The Spanish fought with swords, crossbows, and lances -- if the Inca Empire were still around, it seems likely that a far smaller modern military force could defeat them. Yet, clearly no single soldier is in a position to take over the world, or even a small city. Military coup d'etats are the closest, but involve convincing a significant fraction of the military that is in their interest to seize power. Of course most soldiers wish to serve their nation, not seize power, which goes some way to explaining the relatively low rate of coup attempts. But it's also notable that many coup attempts fail, or at least do not lead to a stable military dictatorship, precisely because of difficulty of internal coordination. After all, if someone intends to destroy the current power structure and violate their promises, how much can you trust that they'll really have your back if you support them?

An interesting consequence of this is that it's ambiguous whether making AI more cooperative makes the situation better or worse.

Thanks for the quick reply! I definitely don't feel confident in the 20W number, I could believe 13W is true for more energy efficient (small) humans, in which case I agree your claim ends up being true some of the time (but as you say, there's little wiggle room). Changing it to 1000x seems like a good solution though which gives you plenty of margin for error.

This is a nitpick, but I don't think this claim is quite right (emphasis added)

 If a silicon-chip AGI server were literally 10,000× the volume, 10,000× the mass, and 10,000× the power consumption of a human brain, with comparable performance, I don’t think anyone would be particularly bothered—in particular, its electricity costs would still be below my local minimum wage!!

First, how much power does the brain use? 20 watts is StackExchange's answer, but I've struggled to find good references here. The appealingly named Appraising the brain's energy budget gives 20% of the overall calories consumed by the body, but that begs the question of the power consumption of the human body, and whether this is at rest or under exertion, etc. Still, I don't think the 20 watts figure is more than 2x off, so let's soldier on.

10,000 times 20 watts is 200 kW. That's a large but not insane amount of power. You could just about run that load on a domestic power supply in the US (some larger homes might have a 200A @ 120V circuit, for 192 kW of permissible load under the 80% rule). Of course you wouldn't be able to power the HVAC needed to cool all these chips, but let's suppose you live in Alaska and can just open the windows.

At the time of writing, the cheapest US electricity prices are around $0.09 per kWh with many states (including Alaska, unfortunately) being twice that at around $0.20/kWh. But let's suppose you're in both a cool climate and have a really great deal on electricity. So your 200kWh of chips costs you just $0.09*200=$18/hour.

Federal minimum wage is $7.25/hour, and the highest I'm aware of in any US state is $15/hour. So it seems that you won't be cheaper than the brain on electricity prices if 10,000 times less efficient. I've systematically tried to make favorable assumptions here. Your 200kW proto-AGI probably won't be in an Alaskan garage, but in a tech company's data center with according costs for HVAC, redundant power, security, etc. Colo costs vary widely depending on location and economies of scale. A recent quote I had was at around the $0.4 kWh/mark -- so about 4x the cost quoted above.

This doesn't massively change the qualitative takeaway, which is that even if something was 10,000 (or even a million times) less efficient than the brain, we'd absolutely still go ahead and build a demo anyway. But it is worth noting that something at the $60/hour range might not actually be all that transformative unless it's able to perform highly skilled labor -- at least until we make it more efficient (which would happen quite rapidly).

"The Floating Droid" example is interesting as there's a genuine ambiguity in the task specification here. In some sense that means there's no "good" behavior for a prompted imitation model here. (For an instruction-following model, we might want it to ask for clarification, but that's outside the scope of this contest.) But it's interesting the interpretation flips with model scale, and in the opposite direction to what I'd have predicted (doing EV calculations are harder so I'd have expected scale to increase not decrease EV answers.) Follow-up questions I'd be excited to see the author address include:

  1. Does the problem go away if we include an example where EV and actual outcome disagree? Or do the large number of other spuriously correlated examples overwhelm that?

  2. How sensitive is this to prompt? Can we prompt it some other way that makes smaller models more likely to do actual outcome, and larger models care about EV? My guess is the training data that's similar to those prompts does end up being more about actual outcomes (perhaps this says something about the frequency of probabilistic vs non-probabilistic thinking on internet text!), and that larger language models end up capturing that. But perhaps putting the system in a different "personality" is enough to resolve this. "You are a smart, statistical assistant bot that can perform complex calculations to evaluate the outcomes of bets. Now, let's answer these questions, and think step by step."

It's not clear to me how we can encourage rigor where effective without discouraging research on areas where rigor isn't currently practical. If anyone has ideas on this, I'd be very interested.

A rough heuristic I have is that if the idea you're introducing is highly novel, it's OK to not be rigorous. Your contribution is bringing this new, potentially very promising, idea to people's attention. You're seeking feedback on how promising it really is and where people are confused , which will be helpful for then later formalizing it and studying it more rigorously.

But if you're engaging with a large existing literature and everyone seems to be confused and talking past each other (which I'd characterize a significant fraction of the mesa-optimization literature, for example) -- then the time has come to make things more rigorous, and you are unlikely to make much further progress without it.

Work that is still outside the academic Overton window can be brought into academia if it can be approached with the technical rigor of academia, and work that meets academic standards is much more valuable than work that doesn't; this is both because it can be picked up by the ML community, and because it's much harder to tell if you are making meaningful progress if your work doesn't meet these standards of rigor.

Strong agreement with this! I'm frequently told by people that you "cannot publish" on a certain area, but in my experience this is rarely true. Rather, you have to put more work into communicating your idea, and justifying the claims you make -- both a valuable exercise! Of course you'll have a harder time publishing than on something that people immediately understand -- but people do respect novel and interesting work, so done well I think it's much better for your career than one might naively expect.

I especially wish there was more emphasis on rigor on the Alignment Forum and elsewhere: it can be valuable to do early-stage work that's more sloppy (rigor is slow and expensive), but when there's long-standing disagreements it's usually better to start formalizing things or performing empirical work than continuing to opine.

That said, I do think academia has some systemic blindspots. For one, I think CS is too dismissive of speculative and conceptual research -- much of this work will end up being mistaken admittedly, but it's an invaluable source of ideas. I also think there's too much emphasis on an "algorithmic contribution" in ML, which leads to undervaluing careful empirical valuations and understanding failure modes of existing systems.

A related dataset is Waterbirds, described in Sagawa et al (2020), where you want to classify birds as landbirds or waterbirds regardless of whether they happen to be on a water or land background.

The main difference from HappyFaces is that in Waterbirds the correlation between bird type and background is imperfect, although strong. By contrast, HappyFaces has perfect spurious correlation on the training set. Of course you could filter Waterbirds to make the spurious correlation perfect to get an equally challenging but more natural dataset.

My sense is that Stuart assuming there's an initial-specified reward function is a simplification, not a key part of the plan, and that he'd also be interested in e.g. generalizing a reward function learned from other sources of human feedback like preference comparison.

IRD would do well on this problem because it has an explicit distribution over possible reward functions, but this isn't really that unique to IRD -- Bayesian IRL or preference comparison would have the same property.

I googled "model-based RL Atari" and the first hit was this which likewise tries to learn the reward function by supervised learning from observations of past rewards (if I understand correctly)

Ah, the "model-based using a model-free RL algorithm" approach :) They learn a world model using supervised learning, and then use PPO (a model-free RL algorithm) to train a policy in it. It sounds odd but it makes sense: you hopefully get much of the sample efficiency of model-based training, while still retaining the state-of-the-art results of model-free RL. You're right that in this setup, as the actions are being chosen by the (model-free RL) policy, you don't get any zero-shot generalization.

I added a new sub-bullet at the top to clarify that it's hard to explain by RL unless you assume the planner can query the ground-truth reward function in arbitrary hypothetical states. And then I also added a new paragraph to the "other possible explanations" section at the bottom saying what I said in the paragraph just above. Thank you.

Thanks for updating the post to clarify this point -- I agree with you with the new wording.

In ML today, the reward function is typically a function of states and actions, not "thoughts". In a brain, the reward can depend directly on what you're imagining doing or planning to do, or even just what you're thinking about. That's my proposal here.

Yes indeed, your proposal is quite different from RL. The closest I can think of to rewards over "thoughts" in ML would be regularization terms that take into account weights or, occasionally, activations -- but that's very crude compared to what you're proposing.

Load More