That being said, if we can have an AGI that is only aligned to what we want now, it would already be a huge win. [...] Solving philosophy: This is a great-to-have but the implications for not solving philosophy does not seem catastrophic.

I tried to argue the opposite in the following posts. I'm curious if you've seen them and still disagree with my position.

If a coordination point is sticking, reducing it to a financial trade helps speed it up, by turning the hidden information into a willingness-to-pay / willingness-to-be-paid integer.

I don't disagree with this. I would add that if agents aren't aligned, then that introduces an additional inefficiency into this pricing process, because each agent now has an incentive to distort the price to benefit themselves, and this (together with information asymmetry) means some mutually profitable trades will not occur.

Figuring out the costs of an action in someone else’s world is detailed and costly work, and price mechanisms + incentives can communicate this information far more efficiently, and in these two situations having trust-in-honesty (and very aligned goals) does not change this fact.

Some work being "detailed and costly" isn't necessarily a big problem for HCH, since we theoretically have an infinite tree of free labor, whereas the inefficiencies introduced by agents having different values/interests seem potentially of a different character. I'm not super confident about this (and I'm overall pretty skeptical about HCH for this and other reasons), but just think that John was too confident in his position in the OP or at least hasn't explained his position enough. To restate the question I see being unanswered: why is alignment + infinite free labor still not enough to overcome the problems we see with actual human orgs?

My takeaway was that the difficulty of credit assignment is a major limiting factor

With existing human institutions, a big part of the problem has to be that every participant has an incentive to distort the credit assignment (i.e., cause more credit to be assigned to oneself). (This is what I conclude from economic theory and also fits with my experience and common sense.) It may well be the case that even if you removed this issue, credit assignment would still be a major problem for things like HCH, but how can you know this from empirical experience with real-world human institutions (which you emphasize in the OP)? If you know of some theory/math/model that says that credit assignment would be a big problem with HCH, why not talk about that instead?

E.g. for unconscious economics in particular, the selection effects mostly apply to memetics in the HCH tree. And in versions of HCH which allow repeated calls to the same human (as Paul’s later version of the proposal does IIRC), unconscious economics applies in the more traditional ways as well.

I'm still not getting a good picture of what your thinking is on this. Seems like the inferential gap is wider than you're expecting? Can you go into more details, and maybe include an example?

(1) seems mostly-irrelevant-in-practice to me; do you want to give an example or two of where it would be relevant?

My intuition around (1) being important mostly comes from studying things like industrial organization and theory of the firm. If you look at the economic theories (mostly based on game theory today) that try to explain why economies are organized the way they are, and where market inefficiencies come from, they all have a fundamental dependence on the assumption of different participants having different interests/values. In other words, if you removed that assumption from the theoretical models and replaced it with the opposite assumption, they would collapse in the sense that all or most of the inefficiencies ("transaction costs") would go away and it would become very puzzling why, for example, there are large hierarchical firms instead of everyone being independent contractors who just buy and sell their labor/services on the open market, or why monopolies are bad (i.e., cause "deadweight loss" in the economy).

I still have some uncertainty that maybe these ivory tower theories/economists are wrong, and you're actually right about (1) not being that important, but I'd need some more explanations/arguments in that direction for it to be more than a small doubt at this point.

This post seems to rely too much on transferring intuitions about existing human institutions to the new (e.g. HCH) setting, where there are two big differences that could invalidate those intuitions:

  1. Real humans all have different interests/values, even if most of them on a conscious level are trying to help.
  2. Real humans are very costly and institutions have to economize on them. (Is coordination still a scarce resource if we can cheaply add more "coordinators"?)

In this post, you don't explain in any detail why you think the intuitions should nevertheless transfer. I read some of the linked posts that might explain this, and couldn't find an explanation in them either. They seem to talk about problems in human institutions, and don't mention why the same issues might exist in new constructs such as HCH despite the differences that I mention. For example you link "those selection pressures apply regardless of the peoples’ intentions" to Unconscious Economics but it's just not obvious to me how that post applies in the case of HCH.

As a more general/tangential comment, I'm a bit confused about how "elevate hypothesis to our attention" is supposed to work. I mean it took some conscious effort to come up with a possible mechanistic story about how "inner reward optimizer" might arise, so how were we supposed to come up with such a story without paying attention to "inner reward optimizer" in the first place?

Perhaps it's not that we should literally pay no attention to "inner reward optimizer" until we have a good mechanistic story for it, but more like we are (or were) paying too much attention to it, given that we don't (didn't) yet have a good mechanistic story? (But if so, how to decide how much is too much?)

At this point, there isn’t a strong reason to elevate this “inner reward optimizer” hypothesis to our attention. The idea that AIs will get really smart and primarily optimize some reward signal… I don’t know of any good mechanistic stories for that. I’d love to hear some, if there are any.

Here's a story:

  1. Suppose we provide the reward as an explicit input to the agent (in addition to using it as antecedent-computation-reinforcer)
  2. If the agent has developed curiosity, it will think thoughts like "What is this number in my input stream?" and later "Hmm it seems correlated to my behavior in certain ways."
  3. If the agent has developed cognitive machinery for doing exploration (in the explore/exploit sense) or philosophy, at some later point it might have thoughts like "What if I explicitly tried to increase this number? Would that be a good idea or bad?"
  4. It might still answer "bad", but at this point the outer optimizer might notice (do the algorithmic equivalent of thinking the following), "If I modified this agent slightly by making it answer 'good' instead (or increasing its probability of answering 'good'), then expected future reward will be increased." In other words, there seems a fairly obvious gradient towards becoming a reward-maximizer at this point.

I don't think this is guaranteed to happen, but seems likely enough to elevate “inner reward optimizer” hypothesis to our attention, at least.

I doubt that’s the primary component that makes the difference. Other countries which did mostly sensible things early are eg Australia, Czechia, Vietnam, New Zealand, Iceland.

What do you think is the primary component? I seem to recall reading somewhere that previous experience with SARS makes a big difference. I guess my more general point is that if the good COVID responses can mostly be explained by factors that predictably won't be available to the median AI risk response, then the variance in COVID response doesn't help to give much hope for a good AI risk response.

My main claim isn’t about what a median response would be, but something like “difference between median early covid governmental response and actually good early covid response was something between 1 and 2 sigma; this suggests bad response isn’t over-determined, and sensibe responses are within human reach”.

This seems to depend on response to AI risk being of similar difficulty as response to COVID. I think people who updated towards "bad response to AI risk is overdetermined" did so partly on the basis that the former is much harder. (In other words, if the median government has done this badly against COVID, what chance does it have against something much harder?) I wrote down a list of things that make COVID an easier challenge, which I now realize may be a bit of a tangent if that's not the main thing you want to argue about, but I'll put down here anyway so as to not waste it.

  1. it's relatively intuitive for humans to think about the mechanics of the danger and possible countermeasures
  2. previous human experiences with pandemics, including very similar ones like SARS
  3. there are very effective countermeasures that are much easier / less costly than comparable countermeasures for AI risk, such as distributing high quality masks to everyone and sealing one's borders
  4. COVID isn't agenty and can't fight back intelligently
  5. potentially divisive issues in AI risk response seem to be a strict superset of politically divisive issues in COVID response (additional issues include: how to weigh very long term benefits against short term costs, the sentience, moral worth, and rights of AIs, what kind of values do we want AIs to have, and/or who should have control/access to AI)

For people who doubt this, I’d point to variance in initial governmental-level response to COVID19, which ranged from “highly incompetent” (eg. early US) to “quite competent” (eg Taiwan).

Seems worth noting that Taiwan is an outlier in terms of average IQ of its population. Given this, I find it pretty unlikely that typical governmental response to AI would be more akin to Taiwan than the US.

I think until recently, I've been consistently more pessimistic than Eliezer about AI existential safety. Here's a 2004 SL4 post for example where I tried to argue against MIRI (SIAI at the time) trying to build a safe AI (and again in 2011). I've made my own list of sources of AI risk that's somewhat similar to this list. But it seems to me that there are still various "outs" from certain doom, such that my probability of a good outcome is closer to 20% (maybe a range of 10-30% depending on my mood) than 1%.

  1. Human thought partially exposes only a partially scrutable outer surface layer. Words only trace our real thoughts. Words are not an AGI-complete data representation in its native style. The underparts of human thought are not exposed for direct imitation learning and can't be put in any dataset. This makes it hard and probably impossible to train a powerful system entirely on imitation of human words or other human-legible contents, which are only impoverished subsystems of human thoughts; unless that system is powerful enough to contain inner intelligences figuring out the humans, and at that point it is no longer really working as imitative human thought.

One of the biggest "outs" I see is that it turns out to be not that hard "to train a powerful system entirely on imitation of human words or other human-legible contents", we (e.g., a relatively responsible AI lab) train such a system and then use it to differentially accelerate AI safety research. I definitely think that it's very risky to rely on such black-box human imitations for existential safety, and that a competent civilization would be pursuing other plans where they can end up with greater certainty of success, but it seems there's something like a 20% chance that it just works out anyway.

To explain my thinking a bit more, human children have to learn how to think human thoughts through "imitation of human words or other human-legible contents". It's possible that they can only do this successfully because their genes contain certain key ingredients that enable human thinking, but it also seems possible that children are just implementations of some generic imitation learning algorithm, so our artificial learning algorithms (once they become advanced/powerful enough) won't be worse at learning to think like humans. I don't know how to rule out the latter possibility with very high confidence. Eliezer, if you do, can you please explain this more?

Load More