Q: "Why is that not enough?"
A: Because they are not being funded to produce the right kinds of outputs.
My point is not specific to machine learning. I'm not as familiar with other academic communities, but I think most of the time it would probably be worth engaging with them if there is somewhere where your work could fit.
Speaking for myself…
I think I do a lot of “engaging with neuroscientists” despite not publishing peer-reviewed neuroscience papers:
In my experience people also often know their blog posts aren't very good.
My point (see footnote) is that motivations are complex. I do not believe "the real motivations" is a very useful concept here.
The question becomes why "don't they judge those costs to be worth it"? Is there motivated reasoning involved? Almost certainly yes; there always is.
works for me too now
The link is broken, FYI
Yeah this was super unclear to me; I think it's worth updating the OP.
FYI: my understanding is that "data poisoning" refers to deliberately the training data of somebody else's model which I understand is not what you are describing.
Oh I see. I was getting at the "it's not aligned" bit.
Basically, it seems like if I become a cyborg without understanding what I'm doing, the result is either:
Only the first one seems likely to be sufficiently aligned.
I don't understand the fuss about this; I suspect these phenomena are due to uninteresting, and perhaps even well-understood effects. A colleague of mine had this to say:
...
- After a skim, it looks to me like an instance of hubness: https://www.jmlr.org/papers/volume11/radovanovic10a/radovanovic10a.pdf
- This effect can be a little non-intuitive. There is an old paper in music retrieval where the authors battled to understand why Joni Mitchell's (classic) "Don Juan’s Reckless Daughter" was retrieved confusingly frequently (the same effect) https://d1wqtxts1x
Indeed. I think having a clean, well-understood interface for human/AI interaction seems useful here. I recognize this is a big ask in the current norms and rules around AI development and deployment.
I don't understand what you're getting at RE "personal level".
I think the most fundamental objection to becoming cyborgs is that we don't know how to say whether a person retains control over the cyborg they become a part of.
FWIW, I didn't mean to kick off a historical debate, which seems like probably not a very valuable use of y'all's time.
Unfortunately, I think even "catastrophic risk" has a high potential to be watered down and be applied to situations where dozens as opposed to millions/billions die. Even existential risk has this potential, actually, but I think it's a safer bet.
I don't think we should try and come up with a special term for (1).
The best term might be "AI engineering". The only thing it needs to be distinguished from is "AI science".
I think ML people overwhelmingly identify as doing one of those 2 things, and find it annoying and ridiculous when people in this community act like we are the only ones who care about building systems that work as intended.
I say it is a rebrand of the "AI (x-)safety" community.
When AI alignment came along we were calling it AI safety, even though it was really basically AI existential safety all along that everyone in the community meant. "AI safety" was (IMO) a somewhat successful bid for more mainstream acceptance, that then lead to dillution and confusion, necessitating a new term.
I don't think the history is that important; what's important is having good terminology going forward.
This is also why I stress that I work on AI existential safety.
So I think people shou...
Hmm... this is a good point.
I think structural risk is often a better description of reality, but I can see a rhetorical argument against framing things that way. One problem I see with doing that is that I think it leads people to think the solution is just for AI developers to be more careful, rather than observing that there will be structural incentives (etc.) pushing for less caution.
I do think that there’s a pretty solid dichotomy between (A) “the AGI does things specifically intended by its designers” and (B) “the AGI does things that the designers never wanted it to do”.
1) I don't think this dichotomy is as solid as it seems once you start poking at it... e.g. in your war example, it would be odd to say that the designers of the AGI systems that wiped out humans intended for that outcome to occur. Intentions are perhaps best thought of as incomplete specifications.
2) From our current position, I think “never ever create...
While defining accident as “incident that was not specifically intended & desired by the people who pressed ‘run’ on the AGI code” is extremely broad, it still supposes that there is such a thing as "the AGI code", which significantly restricts the space of possibile risks.
There are other reasons I would not be happy with that browser extension. There is not one specific conversation I can point to; it comes up regularly. I think this replacement would probably lead to a lot of confusion, since I think when people use the word "accide...
I think the construction gives us $C(\pi) \leq C(U) + e$ for a small constant $e$ (representing the wrapper). It seems like any compression you can apply to the reward function can be translated to the policy via the wrapper. So then you would never have $C(\pi) >> C(U)$. What am I missing/misunderstanding?
By "intend" do you mean that they sought that outcome / selected for it?
Or merely that it was a known or predictable outcome of their behavior?
I think "unintentional" would already probably be a better term in most cases.
Apologies, I didn't take the time to understand all of this yet, but I have a basic question you might have an answer to...
We know how to map (deterministic) policies to reward functions using the construction at the bottom of page 6 of the reward modelling agenda (https://arxiv.org/abs/1811.07871v1): the agent is rewarded only if it has so far done exactly what the policy would do. I think of this as a wrapper function (https://en.wikipedia.org/wiki/Wrapper_function).
It seems like this means that, for any policy, we can represent it as optimizing re...
"Concrete Problems in AI Safety" used this distinction to make this point, and I think it was likely a useful simplification in that context. I generally think spelling it out is better, and I think people will pattern match your concerns onto the “the sci-fi scenario where AI spontaneously becomes conscious, goes rogue, and pursues its own goal” or "boring old robustness problems" if you don't invoke structural risk. I think structural risk plays a crucial role in the arguments, and even if you think things that look more like pure accidents a...
I agree somewhat, however, I think we need to be careful to distinguish "do unsavory things" from "cause human extinction", and should generally be squarely focused on the latter. The former easily becomes too political, making coordination harder.
Yes it may be useful in some very limited contexts. I can't recall a time I have seen it in writing and felt like it was not a counter-productive framing.
AI is highly non-analogous with guns.
I think inadequate equillibrium is too specific and insider jargon-y.
I really don't think the distinction is meaningful or useful in almost any situation. I think if people want to make something like this distinction they should just be more clear about exactly what they are talking about.
How about the distinction between (A) “An AGI kills every human, and the people who turned on the AGI didn’t want that to happen” versus (B) “An AGI kills every human, and the people who turned on the AGI did want that to happen”?
I’m guessing that you’re going to say “That’s not a useful distinction because (B) is stupid. Obviously nobody is talking about (B)”. In which case, my response is “The things that are obvious to you and me are not necessarily obvious to people who are new to thinking carefully about AGI x-risk.”
…And in particular, normal people s...
This is a great post. Thanks for writing it! I think Figure 1 is quite compelling and thought provoking.
I began writing a response, and then realized a lot of what I wanted to say has already been said by others, so I just noted where that was the case. I'll focus on points of disagreement.
Summary: I think the basic argument of the post is well summarized in Figure 1, and by Vanessa Kosoy’s comment.
A high-level counter-argument I didn't see others making:
This is a great post. Thanks for writing it!
I agree with a lot of the counter-arguments others have mentioned.
Summary:
This post tacitly endorses the "accident vs. misuse" dichotomy.
Every time this appears, I feel compelled to mention I think is a terrible framing.
I believe the large majority of AI x-risk is best understood as "structural" in nature: https://forum.effectivealtruism.org/posts/oqveRcMwRMDk6SYXM/clarifications-about-structural-risk-from-ai
I understand your point of view and think it is reasonable.
However, I don't think "don't build bigger models" and "don't train models to do complicated things" need to be at odds with each other. I see the argument you are making, but I think success on these asks are likely highly correlated via the underlying causal factor of humanity being concerned enough about AI x-risk and coordinated enough to ensure responsible AI development.
I also think the training procedure matters a lot (and you seem to be suggesting otherwise?), since if you don't do RL...
(A very quick response):
Agree with (1) and (2).
I am ambivalent RE (3) and the replaceability arguments.
RE (4): I largely agree, but I think the norm should be "let's try to do less ambitious stuff properly" rather than "let's try to do the most ambitious stuff we can, and then try and figure out how to do it as safely as possible as a secondary objective".
it will be better to build such systems from the ground up, rather than taking existing unsafe systems and tweaking them to be safe
I would say "it may be better, and people should seriously consider this" not "it is better".
I don't really buy this argument.
Hmm I feel a bit damned by faint praise here... it seems like more than type-checking, you are agreeing substantively with my points (or at least, I fail to find any substantive disagreement with/in your response).
Perhaps the main disagreement is about the definition of interpretability, where it seems like the goalposts are moving... you say (paraphrasing) "interpretability is a necessary step to robustly/correctly grounding symbols". I can interpret that in a few ways:
It could work if you can use interpretability to effectively prohibit this from happening before it is too late. Otherwise, it doesn't seem like it would work.
Actually I really don't think it does... the argument there is that:
This is only tangentially related to the point I'm making in my post, because:
I think that it implicitly uses the words 'mechanistic interpretability' differently than people typically do.
I disagree. I think in practice people say mechanistic interpretability all the time and almost never say these other more specific things. This feels a bit like moving the goalposts to me. And I already said in the caveats that it could be useful even if the most ambitious version doesn't pan out.
...For instance, we don't need to understand how models predict whether the next token is ' is' or ' was' in order to be able to ga
we do not understand the principles of mechanics, thermodynamics or chemistry required to build an engine.
If this is true, then it makes (mechanistic) interpretability much harder as well, as we'll need our interpretability tools to somehow teach us these underlying principles, as you go on to say. I don't think this is the primary stated motivation for mechanistic interpretability. The main stated motivations seem to be roughly "We can figure out if the model is doing bad (e.g. deceptive) stuff and then do one or more of: 1) not deploy it, 2) not build systems that way, 3) train against our operationalization of deception"
I agree it's a spectrum. I would put it this way:
Regarding difficulty of (1) vs. (2), OTMH, there may ...
I'm very unconvinced by the results in the IOI paper.
Responding in order:
1) yeah I wasn't saying it's what her post is about. But I think you can get two more interesting cruxy stuff by interpreting it that way.
2) yep it's just a caveat I mentioned for completeness.
3) Your spontaneous reasoning doesn't say that we/it get(/s) good enough at getting it to output things humans approve of before it kills us. Also, I think we're already at "we can't tell if the model is aligned or not", but this won't stop deployment. I think the default situation isn't that we can tell if things are going wrong, but people won't be careful enough even given that, so maybe it's just a difference of perspective or something... hmm.......
TBC, "more concerned" doesn't mean I'm not concerned about the other ones... and I just noticed that I make this mistake all the time when reading people say they are more concerned about present-day issues than x-risk....... hmmm........
This is a great response to a great post!
I mostly agree with the points made here, so I'll just highlight differences in my views.
Briefly, I think Katja's post provides good arguments for (1) "things will go fine given slow take-off", but this post interprets it as arguing for (2) "things will go fine given AI never becomes dangerously capable". I don't think the arguments here do quite enough to refute claim (1), although I'm not sure they are meant to, given the scope ("we are not discussing").
A few other notable differences:
"Sample the face at the point of highest probability density in the generative model's latent space". For GANs and diffusion models (the models we in fact generate faces with), you can do exactly this by setting the Gaussian latents to zeros, and you will see that the result is a perfectly normal, non-Eldritch human face.
(sort of nitpicking):
I think it makes more sense to look for the highest density in pixel space; this requires integrating over all settings of the latents (unless your generator is invertible, in which case you can just use change o...
author here -- Yes we got this comment from reviewers in the most recent round as well. ADS is a bit more general than performative prediction, since it applies outside of prediction context. Still very closely related.
On the other hand, The point of our work is something that people in the performative prediction community seem to only slowly be approaching, which is the incentive for ADS. Work on CIDs is much more related in that sense.
As a historical note: We starting working on this March or April 2018; Performative prediction was on arXiv Feb 2020, ours was at a safety workshop in mid 2019, but not on arXiv until Sept 2020.
I'm not necessarily saying people are subconsciously trying to create a moat.
I'm saying they are acting in a way that creates a moat, and that enables them to avoid competition, and that more competition would create more motivation for them to write things up for academic audiences (or even just write more clearly for non-academic audiences).