Just someone who wants to learn about the world.
I change my views often. Anything I wrote that's more than 10 days old should be treated as potentially outdated.
I also see how you might have a catastrophe-avoiding agent capable of large positive impacts, assuming an ontology but without assuming a lot about human preferences.
I find this interesting but I'd be surprised if it were true :). I look forward to seeing it in the upcoming posts.
That said, I want to draw your attention to my definition of catastrophe, which I think is different than the way most people use the term. I think most broadly, you might think of a catastrophe as something that we would never want to happen even once. But for inner alignment, this isn't always helpful, since sometimes we want our systems to crash into the ground rather than intelligently optimizing against us, even if we never want them to crash into the ground even once. And as a starting point, we should try to mitigate these malicious failures much more than the benign ones, even if a benign failure would have a large value-neutral impact.
A closely related notion to my definition is the term "unacceptable behavior" as Paul Christiano has used it. This is the way he has defined it,
In different contexts, different behavior might be acceptable and it’s up to the user of these techniques to decide. For example, a self-driving car trainer might specify: Crashing your car is tragic but acceptable. Deliberately covering up the fact that you crashed is unacceptable.
It seems like if we want to come up with a way to avoid these types of behavior, we simply must use some dependence on human values. I can't see how to consistently separate acceptable failures from non-acceptable ones except by inferring our values.
Ahh. To be honest, I read that, but then responded to something different. I assumed you were just expressing general pessimism, since there's no guarantee that we would converge on good values upon a long reflection (and you recently viscerally realized that values are very arbitrary).
Now I see that your worry is more narrow, in that the cultural revolution might happen during this period, and would act unwisely to create the AGI during its wake. I guess this seems quite plausible, and is an important concern, though I personally am skeptical that anything like the long reflection will ever happen.
I tend to be fairly skeptical of these challenges—HCH is just a bunch of humans after all and if you can instruct them not to do things like instantiate arbitrary Turing machines, then I think a bunch of humans put together has a strong case for being aligned.
Minor nitpick: I mostly agree, but I feel like a lot of work is being done by saying that they can't instantiate arbitrary Turing machines, and that it's just a bunch of humans. Human society is also a bunch of humans, but frequently does things that I can't imagine any single intelligent person deciding. If your model breaks down for relatively human-human combinations, I think there is a significant risk that true HCH would be dangerous in quite unpredictable ways.
It sounds like you think that something like another Communist Revolution or Cultural Revolution could happen (that emphasizes some random virtues at the expense of others), but the effect would be temporary and after it's over, longer term trends will reassert themselves. Does that seem fair?
That's pretty fair.
I think it's likely that another cultural revolution could happen, and this could adversely affect the future if it happens simultaneously with a transition into an AI based economy. However, the deviations from long-term trends are very hard to predict, as you point out, and we should know about the specifics more as we get further along. In the absence of concrete details, I find it far more helpful to use information from long-term trends rather than worrying about specific scenarios.
I could be wrong here, but the stuff you mentioned as counterexamples to my model appear either ephemeral, or too particular. The "last few years" of political correctness is hardly enough time to judge world-trends by, right? By contrast, the stuff I mentioned (end of slavery, explicit policies against racism and war) seem likely to stick and stay with us for decades, if not centuries.
We can explain this after the fact by saying that the Left is being forced by impersonal social dynamics, e.g., runaway virtue signaling, to over-correct, but did anyone predict this ahead of time?
When I listen to old recordings of right wing talk show hosts from decades ago, they seem to be saying the same stuff that current people are saying today, about political correctness and being forced out of academia for saying things that are deemed harmful by the social elite, or about the Left being obsessed by equality and identity. So I would definitely say that a lot of people predicted this would happen.
The main difference is that it's now been amplified as recent political events have increased polarization, the people with older values are dying of old age or losing their power, and we have social media that makes us more aware of what is happening. But in hindsight I think this scenario isn't that surprising.
Russia and China adopted communism even though they were extremely poor
Of course, you can point to a few examples of where my model fails. I'm talking about the general trends rather than the specific cases. If we think in terms of world history, I would say that Russia in the early 20th century was "rich" in the sense that it was much richer than countries in previous centuries and this enabled it to implement communism in the first place. Government power waxes and wanes, but over time I think its power has definitely gone up as the world has gotten richer, and I think this could have been predicted.
Part of why I'm skeptical of these concerns is that it seems like a lot of moral behavior is predictable as society gets richer, and we can model the social dynamics to predict some outcomes will be good.
As evidence for the predictability, consider that rich societies are more open to LGBT rights, they have explicit policies against racism, against war, slavery, torture, and it seems like rich societies are moving in the direction of government control over many aspects of life, such as education and healthcare. Is this just a quirk of our timeline, or a natural feature of civilizations of humans as they get richer?
I am inclined to think much of it is the latter.
That's not to say that I think the current path we're going on is a good one. I just think it's more predictable than what you seem to think. Given its predictability, I feel somewhat confident in the following statements: eventually, when aging is cured, people will adopt policies that give people the choice to die. Eventually, when artificial meat is very cheap and tasty, people will ban animal-based meat.
I'm not predicting these outcomes because I am confusing what I hope for and what I think will happen. I just genuinely think that human virtue signaling dynamics will be favorable to those outcomes.
I'm less confident, leaning pessimistic about these questions: I don't think humans will inevitably care about wild animal suffering. I don't think humans will inevitably create a post-human utopia where people can modify their minds into any sort of blissful existence they imagine, and I don't think humans will inevitably care about subroutine suffering. It's these questions that make me uneasy about the future.
Sure, we can talk about this over video. Check your Facebook messages.
Computing the fastest route to Paris doesn't involve search?
More generally, I think in order for it to work your example can't contain subroutines that perform search over actions. Nor can it contain subroutines such that, when called in the order that the agent typically calls them, they collectively constitute a search over actions.
My example uses search, but the search is not the search of the inner alignment failure. It is merely a subroutine that is called upon by this outer superstructure, which itself is the part that is misaligned. Therefore, I fail to see why my point doesn't follow.
If your position is that inner alignment failures must only occur when internal searches are misaligned with the reward function used during training, then my example would be a counterexample to your claim, since the reason for misalignment was not due to a search being misaligned (except under some unnatural rationalization of the agent, which is a source of disagreement highlighted in the post, and in my discussion with Evan above).
If one's interpretation of the 'objective' of the agent is full of piecewise statements and ad-hoc cases, then what exactly are we doing it by describing it as maximizing an objective in the first place? You might as well describe a calculator by saying that it's maximizing the probability of outputting the following [write out the source code that leads to its outputs]. At some point the model breaks down, and the idea that it is following an objective is completely epiphenomenal to its actual operation. The model that it is maximizing an objective doesn't shed light on its internal operations any more than just spelling out exactly what its source code is.
I feel like what you're describing here is just optimization where the objective is determined by a switch statement
Typically when we imagine objectives, we think of a score which rates how well an agent performed some goal in the world. How exactly does the switch statement 'determine' the objective?
Let's say that a human is given the instructions, "If you see the coin flip heads, then become a doctor. If you see the coin flip tails, then become a lawyer." what 'objective function' is it maximizing here? If it's maximizing some weird objective function like, "probability of becoming a doctor in worlds where the coin flips heads, and probability of becoming a lawyer in worlds where the coin flips tails" this would seem to be unnatural, no? Why not simply describe it as a switch case agent instead?
Remember, this matters because we want to be perfectly clear about what types of transparency schemes work. A transparency scheme that assumes that the agent has a well-defined objective that it is using a search to optimize for, would, I think, would fail in the examples I gave. This becomes especially true if the if-statements are complicated nested structures, and repeat as part of some even more complicated loop, which seems likely.
ETA: Basically, you can always rationalize an objective function for any agent that you are given. But the question is simply, what's the best model of our agent, in the sense of being able to mitigate failures. I think most people would not categorize the lunar lander as a search-based agent, even though you could say that it is under some interpretation. The same is true with humans, plants, animals.