I see, thanks. I feel like the closest analogy here that seems viable to me would be to something like: is Open Philanthropy able to hire security experts to improve its security and assess whether they're improving its security? And I think the answer to that is yes. (Most of its grantees aren't doing work where security is very important.)
It feels harder to draw an analogy for something like "helping with standards enforcement," but maybe we could consider OP's ability to assess whether its farm animal welfare grantees are having an impact on who adheres to what standards, and how strong adherence is? I think OP has pretty good (not perfect) ability to do so.
(Chiming in late, sorry!)
I think #3 and #4 are issues, but can be compensated for if aligned AIs outnumber or outclass misaligned AIs by enough. The situation seems fairly analogous to how things are with humans - law-abiding people face a lot of extra constraints, but are still collectively more powerful.
I think #1 is a risk, but it seems <<50% likely to be decisive, especially when considering (a) the possibility for things like space travel, hardened refuges, intense medical interventions, digital people, etc. that could become viable with aligned...
I think I find the "grokking general-purpose search" argument weaker than you do, but it's not clear by how much.
The "we" in "we can point AIs toward and have some ability to assess" meant humans, not Open Phil. You might be arguing for some analogy but it's not immediately clear to me what, so maybe clarify if that's the case?
I don't agree with this characterization, at least for myself. I think people should be doing object-level alignment research now, partly (maybe mostly?) to be in better position to automate it later. I expect alignment researchers to be central to automation attempts.
It seems to me like the basic equation is something like: "If today's alignment researchers would be able to succeed given a lot more time, then they also are reasonably likely to succeed given access to a lot of human-level-ish AIs." There are reasons this could fail (perhaps future alignmen...
It seems like we could simply try to be as vigilant elsewhere as we would be without this measure, and then we could reasonably expect this measure to be net-beneficial (*how* net beneficial is debatable).
I now think I wrote that part poorly. The idea isn't so much that we say to an AI, "Go out and do whatever you need to do - accumulate money, hire analysts, run experiments, etc. - and come back with a plan that we will evaluate."
The idea is more like this:
(Sorry for the long delay here!) The post articulates a number of specific ways in which some AIs can help to supervise others (e.g., patching security holes, generating inputs for adversarial training, finding scary inputs/training processes for threat assessment), and these don't seem to rely on the idea that an AI can automatically fully understand the internals/arguments/motivations/situation of a sufficiently close-in-capabilities other AI. The claim is not that a single supervisory arrangement of that type wipes out all risks, but that enough investm...
(Chiming in late here, sorry!) I think this is a totally valid concern, but I think it's generally helpful to discuss technical and political challenges separately. I think pessimistic folks often say things like "We have no idea how to align an AI," and I see this post as a partial counterpoint to that.
In addition to a small alignment tax (as you mention), a couple other ways I could see the political side going well would be (a) an AI project using a few-month lead to do huge amounts of further helpful work (https://www.lesswrong.com/posts/jwhcXmigv2LTrbBiB/success-without-dignity-a-nearcasting-story-of-avoiding#The_deployment_problem); (b) a standards-and-monitoring regime blocking less cautious training and deployment.
(Chiming in late here, sorry!)
It seems to me like the main crux here is that you're picturing a "phase transition" that kicks in in a fairly unpredictable way, such that a pretty small increase in e.g. inference compute or training compute could lead to a big leap in capabilities. Does that sound right?
I don't think this is implausible but haven't seen a particular reason to consider it likely.
I agree that "checks and balances" between potentially misaligned AIs are tricky and not something we should feel confident in, due to the possibility of sandbagging...
I think Nate and I would agree that this would be safe. But it seems much less realistic in the near term than something along the lines of what I outlined. A lot of the concern is that you can't really get to something equivalent to your proposal using techniques that resembles today's machine learning.
Fixed, thanks!
With apologies for the belated response: I think greghb makes a lot of good points here, and I agree with him on most of the specific disagreements with Daniel. In particular:
I don't think I am following the argument here. You seem focused on the comparison with evolution, which is only a minor part of Bio Anchors, and used primarily as an upper bound. (You say "the number is so vastly large (and actually unknown due to the 'level of details' problem) that it's not really relevant for timelines calculations," but actually Bio Anchors still estimates that the evolution anchor implies a ~50% chance of transformative AI this century.)
Generally, I don't see "A and B are very different" as a knockdown counterargument to "If A requir...
Noting that I commented on this (and other topics) here: https://www.cold-takes.com/replies-to-ancient-tweets/#20220331
The Bio Anchors report is intended as a tool for making debates about AI timelines more concrete, for those who find some bio-anchor-related bound helpful (e.g., some think we should lower bound P(AGI) at some reasonably high number for any year in which we expect to hit a particular kind of "biological anchor"). Ajeya's work lengthened my own timelines, because it helped me understand that some bio-anchor-inspired arguments for shorter timelines didn't have as much going for them as I'd thought; but I think it may have shortened some other folks'.
(The pre...
I agree with this. I often default to acting as though we have ~10-15 years, partly because I think leverage is especially high conditional on timelines in that rough range.
I'm not sure why this isn't a very general counterexample. Once we've decided that the human imitator is simpler and faster to compute, don't all further approaches (e.g., penalizing inconsistency) involve a competitiveness hit along these general lines? Aren't they basically designed to drag the AI away from a fast, simple human imitator toward a slow, complex reporter? If so, why is that better than dragging the AI from a foreign ontology toward a familiar ontology?
Can you explain this: "In Section: specificity we suggested penalizing reporters if they are consistent with many different reporters, which effectively allows us to use consistency to compress the predictor given the reporter." What does it mean to "use consistency to compress the predictor given the reporter" and how does this connect to penalizing reporters if they are consistent with many different predictors?
Here are a couple of hand-wavy "stub" proposals that I sent over to ARC, which they thought were broadly intended to be addressed by existing counterexamples. I'm posting them here so they can respond and clarify why these don't qualify.
*Proposal 1: force ontological compatibility*
On page 34 of the ELK gdoc, the authors talk about the possibility that training an AI hard enough produces a model that has deep mismatches with human ontology - that is, it has a distinct "vocabulary of basic concepts" (or nodes in a Bayes net) that are distinct from the ones h...
Again trying to answer this one despite not feeling fully solid. I'm not sure about the second proposal and might come back to it, but here's my response to the first proposal (force ontological compatibility):
The counterexample "Gradient descent is more efficient than science" should cover this proposal because it implies that the proposal is uncompetitive. Basically, the best Bayes net for making predictions could just turn out to be the super incomprehensible one found by unrestricted gradient descent, so if you force ontological compatibility then you ...
Regarding this:
...The bad reporter needs to specify the entire human model, how to do inference, and how to extract observations. But the complexity of this task depends only on the complexity of the human’s Bayes net.
If the predictor's Bayes net is fairly small, then this may be much more complex than specifying the direct translator. But if we make the predictor's Bayes net very large, then the direct translator can become more complicated — and there is no obvious upper bound on how complicated it could become. Eventually direct translation will be more co
(Note: I read an earlier draft of this report and had a lot of clarifying questions, which are addressed in the public version. I'm continuing that process here.)
I get the impression that you see most of the "builder" moves as helpful (on net, in expectation), even if there are possible worlds where they are unhelpful or harmful. For example, the "How we'd approach ELK in practice" section talks about combining several of the regularizers proposed by the "builder." It also seems like you believe that combining multiple regularizers would create a "stacking...
I hear you on this concern, but it basically seems similar (IMO) to a concern like: "The future of humanity after N more generations will be ~without value, due to all the reflection humans will do - and all the ways their values will change - between now and then." A large set of "ems" gaining control of the future after a lot of "reflection" seems like quite comparable to future humans having control over the future (also after a lot of effective "reflection").
I think there's some validity to worrying about a future with very different values from today'... (read more)