I'm familiar with these claims, and (I believe) the principle supporting arguments that have been made publicly. I think I understand them reasonably well.I don't find them decisive. Some aren't even particularly convincing. A few points:- EY sets up a false dichotomy between "train in safe regimes" and "train in dangerous regimes". In the approaches under discussion there is an ongoing effort (e.g. involving some form(s) of training) to align the system, and the proposal is to keep this effort ahead of advances in capability (in some sense).- The first 2 claims for why corrigibility wouldn't generalize seem to prove too much -- why would intelligence generalize to "qualitatively new thought processes, things being way out of training distribution", but corrigibility would not? - I think the last claim -- that corrigibility is "anti-natural" -- is more compelling. However, we don't actually understand the space of possible utility functions and agent designs well enough for it to be that strong. We know that any behavior is compatible with a utility function, so I would interpret Eliezer's claim as relating to the complexity of description length of utility functions that encode corrigible behavior. Work on incentives suggests that removing manipulation incentives might add very little complexity to the description of the utility function, for an AI system that already understands the world well. Humans also seem to find it simple enough to add the "without manipulation" qualifier to an objective.
I guess actually the goal is just to get something aligned enough to do a pivotal act. I don't see though why an approach that tries to maintain a relatively-sufficient level of alignment (relative to current capabilities) as capabilities scale couldn't work for that.
Different views about the fundamental difficulty of inner alignment seem to be a (the?) major driver of differences in views about how likely AI X risk is overall.
I strongly disagree with inner alignment being the correct crux. It does seem to be true that this is in fact a crux for many people, but I think this is a mistake. It is certainly significant. But I think optimism about outer alignment and global coordination ("Catch-22 vs. Saving Private Ryan") is much bigger factor, and optimists are badly wrong on both points here.
I'm torn because I mostly agree with Eliezer that things don't look good, and most technical approaches don't seem very promising. But the attitude of unmitigated doomyness seems counter-productive.And there's obviously things worth doing and working on and people getting on with it.It seems like Eliezer is implicitly focused on finding an "ultimate solution" to alignment that we can be highly confident solves the problem regardless of how things play out. But this is not where the expected utility is. The expected utility is mostly in buying time and increasing the probability of success in situations where we are not highly confident that we've solved the problem, but we get lucky.Ideally we won't end up rolling the dice on unprincipled alignment approaches. But we probably will. So let's try and load the dice. But let's also remember that that's what we're doing.
Better, but I still think "myopia" is basically misleading here. I would go back to the drawing board *shrug.
It seems a bit weird to me to call this myopia, since (IIUC) the AI is still planning for future impact (just not on other agents).
I think the contradiction may only be apparent, but I thought it was worth mentioning anyways. My point was just that we might actually want certifications to say things about specific algorithms.
Second, we can match the certification to the types of people and institutions, that is, our certifications talk about the executives, citizens, or corporations (rather than e.g. specific algorithms, that may be replaced in the future). Third, the certification system can build in mechanisms for updating the certification criteria periodically.
* I think effective certification is likely to involve expert analysis (including non-technical domain experts) of specific algorithms used in specific contexts. This appears to contradict the "Second" point above somewhat.* I want people to work on developing the infrastructure for such analyses. This is in keeping with the "Third" point.* This will likely involve a massive increase in investment of AI talent in the process of certification.
As an example, I think "manipulative" algorithms -- that treat humans as part of the state to be optimized over -- should be banned in many applications in the near future, and that we need expert involvement to determine the propensity of different algorithms to actually optimize over humans in various contexts.
You can try to partner with industry, and/or advocate for big government $$$.I am generally more optimistic about toy problems than most people, I think, even for things like Debate.Also, scaling laws can probably help here.
um sorta modulo a type error... risk is risk. It doesn't mean the thing has happened (we need to start using some sort of phrase like "x-event" or something for that, I think).