All of Grue_Slinky's Comments + Replies

Nitpick: "transfer learning" is the standard term, no? It has a Wiki page and seems to get a more coherent batch of search results than googling "robustness to data shift".

2Rohin Shah3y
It goes under many names, such as transfer learning, robustness to distributional shift / data shift, and out-of-distribution generalization. Each one has (to me) slightly different connotations, e.g. transfer learning suggests that the researcher has a clear idea of the distinction between the first and second setting (and so you "transfer" from the first to the second), whereas if in RL you change which part of the state space you're in as you act, I would be more likely to call that distributional shift rather than transfer learning.

Whoops, mea culpa on that one! Deleted and changed to:

the main post there pointed out that seemingly anything can be trivially modeled as being a "utility maximizer" (further discussion here), whereas only some intelligent agents can be described as being "goal-directed" (as defined in this post), and the latter is a more useful concept for reasoning about AI safety.

[copying from my comment on the EA Forum x-post]

For reference, some other lists of AI safety problems that can be tackled by non-AI people:

Luke Muehlhauser's big (but somewhat old) list: "How to study superintelligence strategy"

AI Impacts has made several lists of research problems

Wei Dai's, "Problems in AI Alignment that philosophers could potentially contribute to"

Kaj Sotala's case for the relevance of psychology/cog sci to AI safety (I would add that Ought is currently testing the feasibility of IDA/Debate by doing psy... (read more)

1Evan Hubinger3y
Also relevant is Geoffrey Irving and Amanda Askell's "AI Safety Needs Social Scientists [https://distill.pub/2019/safety-needs-social-scientists/]."

*begins drafting longer proposal*

Yeah, this is definitely more high-risk, high-reward than the others, and the fact that there's potentially some very substantial spillover effects if successful makes me both excited and nervous about the concept. I'm thinking of Arbital as an example of "trying to solve way too many problems at once", so I want to manage expectations and just try to make some exercises that inspire people to think about the art of mathematizing certain fuzzy philosophical concepts. (Running title is "Formalization Exercises", but I'm not sure if there's a better pithy name that captures it).

In any case, I appreciate the feedback, Mr. Entworth.

2johnswentworth3y
Oh no, not you too. It was bad enough with just Bena.

(8)

In light of the “Fixed Points” critique, a set of exercises that seem more useful/reflective of MIRI’s research than those exercises. What I have in mind is taking some of the classic success stories of formalized philosophy (e.g. Turing machines, Kolmogorov complexity, Shannon information, Pearlian causality, etc., but this could also be done for reflective oracles and logical induction), introducing the problems they were meant to solve, and giving some stepping stones that guide one to have the intuitions and thoughts that (presu... (read more)

I think this would be an extremely useful exercise for multiple independent reasons:

  • it's directly attempting to teach skills which I do not currently know any reproducible way to teach/learn
  • it involves looking at how breakthroughs happened historically, which is an independently useful meta-strategy
  • it directly involves investigating the intuitions behind foundational ideas relevant to the theory of agency, and could easily expose alternative views/interpretations which are more useful (in some contexts) than the usual presentations

(7)

A critique of MIRI’s “Fixed Points” paradigm, expanding on some points I made on MIRIxDiscord a while ago (which would take a full post to properly articulate). Main issue is, I'm unsure if it's still guiding anyone's research and/or who outside MIRI would care.

(6)

An analysis of what kinds of differential progress we can expect from stronger ML. Actually, I don’t feel like writing this post, but I just don’t understand why Dai and Christiano, respectively, are particularly concerned about differential progress on the polynomial hierarchy and what’s easy-to-measure vs. hard-to-measure. My gut reaction is “maybe, but why privilege that axis of differential progress of all things”, and I can’t resolve that in my mind without doing a comprehensive analysis of potential ȁ... (read more)

3Rohin Shah3y
Re: easy-to-measure vs. hard-to-measure axis: That seems like the most obvious axis on which AI is likely to be different from humans, and it clearly does lead to bad outcomes?

(5)

A skeptical take on Part I of “What failure looks like” (3 objections, to summarize briefly: not much evidence so far, not much precedent historically, and “why this, of all the possible axes of differential progress?”) [Unsure if these objections will stand up if written out more fully]

(4)

A post discussing my confusions about Goodhart and Garrabrant’s taxonomy of it. I find myself not completely satisfied with it:

1) “adversarial” seems too broad to be that useful as a category

2) It doesn’t clarify what phenomenon is meant by “Goodhart”; in particular, “regressional” doesn’t feel like something the original law was talking about, and any natural definition of “Goodhart” that includes it seems really broad

3) Whereas “regressional” and “extrema... (read more)

(3)

“When and why should we be worried about robustness to distributional shift?”: When reading that section of Concrete Problems, there’s a temptation to just say “this isn’t relevant long-term, since an AGI by definition would have solved that problem”. But adversarial examples and the human safety problems (to the extent we worry about them) both say that in some circumstances we don’t expect this to be solved by default. I’d like to think more about when the naïve “AGI will be smart... (read more)

3Rohin Shah3y
Concerns about mesa-optimizers are mostly concerns that "capabilities" will be robust to distributional shift while "objectives" will not be robust.

(2)

[I probably need a better term for this] “Wide-open-source game theory”: Where other agents can not only simulate you, but also figure out "why" you made a given decision. There’s a Standard Objection to this: it’s unfair to compare algorithms in environments where they are judged not only by their actions, but on arbitrary features of their code; to which I say, this isn’t an arbitrary feature. I was thinking about this in the context of how, even if an AGI makes the right decision, we care “whyȁ... (read more)

(1)

A classification of some of the vulnerabilities/issues we might expect AGIs to face because they are potentially open-source, and generally more “transparent” to potential adversaries. For instance, they could face adversarial examples, open-source game theory problems, Dutch books, or weird threats that humans don’t have to deal with. Also, there’s a spectrum from “extreme black box” to “extreme white box” with quite a few plausible milestones along the way, that makes for a certain transparency hierarchy, and it may be helpful to analyze this (or at least take a stab at formulating it).

Upvote this comment (and downvote the others as appropriate) if most of the other ideas don’t seem that fruitful.

By default, I’d mostly take this as a signal of “my time would be better spent working on someone else’s agenda or existing problems that people have posed” but I suppose other alternatives exist, if so comment below.

I have a bunch of half-baked ideas, most of which are mediocre in expectation and probably not worth investing my time and other’s attention writing up. Some of them probably are decent, but I’m not sure which ones, and the user base is probably as good as any for feedback.

So I’m just going to post them all as replies to this comment. Upvote if they seem promising, downvote if not. Comments encouraged. I reserve the “right” to maintain my inside view, but I wouldn’t make this poll if I didn’t put substantial weight on this community’s opinions.

(8)

In light of the “Fixed Points” critique, a set of exercises that seem more useful/reflective of MIRI’s research than those exercises. What I have in mind is taking some of the classic success stories of formalized philosophy (e.g. Turing machines, Kolmogorov complexity, Shannon information, Pearlian causality, etc., but this could also be done for reflective oracles and logical induction), introducing the problems they were meant to solve, and giving some stepping stones that guide one to have the intuitions and thoughts that (presu... (read more)

1Grue_Slinky3y
(7) A critique of MIRI’s “Fixed Points [https://www.alignmentforum.org/s/5WF3wmwvxX9TEbFXf]” paradigm, expanding on some points I made on MIRIxDiscord a while ago (which would take a full post to properly articulate). Main issue is, I'm unsure if it's still guiding anyone's research and/or who outside MIRI would care.
6Grue_Slinky3y
(6) An analysis of what kinds of differential progress we can expect from stronger ML. Actually, I don’t feel like writing this post, but I just don’t understand why Dai and Christiano, respectively, are particularly concerned about differential progress on the polynomial hierarchy [https://www.alignmentforum.org/posts/vbtvgNXkufFRSrx4j/three-ai-safety-related-ideas] and what’s easy-to-measure vs. hard-to-measure [https://www.alignmentforum.org/posts/HBxe6wdjxK239zajf/what-failure-looks-like]. My gut reaction is “maybe, but why privilege that axis of differential progress of all things”, and I can’t resolve that in my mind without doing a comprehensive analysis of potential “differential progresses” that ML could precipitate. Which, argh, sounds like an exhausting task, but someone should do it?

(5)

A skeptical take on Part I of “What failure looks like” (3 objections, to summarize briefly: not much evidence so far, not much precedent historically, and “why this, of all the possible axes of differential progress?”) [Unsure if these objections will stand up if written out more fully]

5Grue_Slinky3y
(4) A post discussing my confusions about Goodhart and Garrabrant’s taxonomy [https://www.alignmentforum.org/posts/EbFABnst8LsidYs5Y/goodhart-taxonomy] of it. I find myself not completely satisfied with it: 1) “adversarial” seems too broad to be that useful as a category 2) It doesn’t clarify what phenomenon is meant by “Goodhart”; in particular, “regressional” doesn’t feel like something the original law was talking about, and any natural definition of “Goodhart” that includes it seems really broad 3) Whereas “regressional” and “extremal” (and perhaps “causal”) are defined statistically, “adversarial” is defined in terms of agents, and this may have downsides (I’m less sure about this objection) But I’m also not sure how I’d reclassify it and that task seems hard. Which partially updates me in favor of the Taxonomy being good, but at the very least I feel there’s more to say about it.
1Grue_Slinky3y
(3) “When and why should we be worried about robustness to distributional shift?”: When reading that section of Concrete Problems [https://arxiv.org/abs/1606.06565], there’s a temptation to just say “this isn’t relevant long-term, since an AGI by definition would have solved that problem”. But adversarial examples [https://openai.com/blog/adversarial-example-research/] and the human safety problems [https://www.alignmentforum.org/posts/vbtvgNXkufFRSrx4j/three-ai-safety-related-ideas] (to the extent we worry about them) both say that in some circumstances we don’t expect this to be solved by default. I’d like to think more about when the naïve “AGI will be smart” intuition applies and when it breaks.
1Grue_Slinky3y
(2) [I probably need a better term for this] “Wide-open-source game theory [https://acritch.com/osgt-is-weird/]”: Where other agents can not only simulate you, but also figure out "why" you made a given decision. There’s a Standard Objection [https://www.lesswrong.com/posts/wk7WmrN4FeNmyepXm/an-introduction-to-newcomblike-problems] to this: it’s unfair to compare algorithms in environments where they are judged not only by their actions, but on arbitrary features of their code; to which I say, this isn’t an arbitrary feature. I was thinking about this in the context [https://www.alignmentforum.org/posts/BKjJJH2cRpJcAnP7T/thoughts-on-human-models] of how, even if an AGI makes the right decision, we care “why” it did so, i.e. because it’s optimizing for what we want vs. optimizing for human approval for instrumental reasons). I doubt we’ll formalize this “why” anytime soon (see e.g. section 5 of this [https://arxiv.org/pdf/1401.5577.pdf]), but I think semi-formal things can be said about it upon some effort. [I thought of this independently from (1), but I think every level of the “transparency hierarchy” could have its own kind of game theory, much like the “open-source” level clearly does]
1Grue_Slinky3y
(1) A classification of some of the vulnerabilities/issues we might expect AGIs to face because they are potentially open-source, and generally more “transparent” to potential adversaries. For instance, they could face adversarial examples, open-source game theory problems, Dutch books, or weird threats that humans don’t have to deal with. Also, there’s a spectrum from “extreme black box” to “extreme white box” with quite a few plausible milestones along the way, that makes for a certain transparency hierarchy, and it may be helpful to analyze this (or at least take a stab at formulating it).
1Grue_Slinky3y
Upvote this comment (and downvote the others as appropriate) if most of the other ideas don’t seem that fruitful. By default, I’d mostly take this as a signal of “my time would be better spent working on someone else’s agenda or existing problems that people have posed” but I suppose other alternatives exist, if so comment below.
This question also has a negative answer, as witnessed by the example of an ant colony --- agent-like behavior without agent-like architecture, produced by a "non-agenty" optimization process of evolution. Nonetheless, a general version of the question remains: If some X exhibits agent-like behavior, does it follow that there exists some interesting physical structure causally upstream of X?

Neat example! But for my part, I'm confused about this last sentence, even after reading the footnote:

An example of such "interesting physical struc
... (read more)
1Vojtech Kovarik3y
First off, while I feel somewhat de-confused about X-like behavior, I don't feel very confident about X-like architectures. Maybe the meaning is somewhat clear on higher levels of abstraction (e.g., if my brain goes "realize I want to describe a concept --> visualize several explanations and judge each for suitability --> pick the one that seems the best --> send a signal to start typing it down", then this would be a kind of search/optimization-thingy). But on the level of physics, I don't really know what an architecture means. So take this with a grain of salt. Maybe the term "physical structure" is misleading. The thing I was trying to point at is the distinction between being able to accurately model Y using model X, and Y actually being X. In the sense that there might be a giant look-up table (GLUT) that accuractly predicts your behavior, but on no level of abstraction is it correct to say that you actually are a GLUT. Whereas modelling you as having some goals, planning, etc. might be less accurate but somewhat more, hm, true. I realize this isn't very precise, but I guess you can see what I mean. That being said, I suppose that what I meant by "optimization architecture" is, for example, a stochastic gradient descent with the emphasis on "this is the input", "this is the part of the algorithm that does the calculation", and "this is the output". An "implementation of an optimization architecture" would be...well, the atoms of your computer that perform SGD, or maybe some simple bacteria that moves in the direction where the concentration of whatever-it-likes is the highest (not that anything I know would implement precisely SGD, but still). Ad "interesting physical structure" behind the ant-colony: If by "evolution" we mean the atoms that the world is made of, as they changed over time until your ant colony emerged...then yeah, this is a physical structure causally upstream of the ant colony, and one that is responsible for the ant colony behaving the wa

For reference, LeCun discussed his atheoretic/experimentalist views in more depth in this FB debate with Ali Rahimi and also this lecture. But maybe we should distinguish some distinct axes of the experimentalist/theorist divide in DL:

1) Experimentalism/theorism is a more appropriate paradigm for thinking about AI safety

2) Experimentalism/theorism is a more appropriate paradigm for making progress in AI capabilities

Where the LeCun/Russell debate is about (1) and LeCun/Rahimi is about (2). And maybe this is oversimplifying things, since "theorism"... (read more)

3orthonormal3y
Good comment. I disagree with this bit: And then it would probably have been seen as outmoded and thrown away completely when AI capabilities research progressed into realms that vastly surpassed GOFAI. I don't know that there's an easy way to get capabilities researchers to think seriously about safety concerns that haven't manifested on a sufficient scale yet.

That all seems pretty fair.

If a system is trying to align with idealized reflectively-endorsed values (similar to CEV), then one might expect such values to be coherent.

That's why I distinguished between the hypotheses of "human utility" and CEV. It is my vague understanding (and I could be wrong) that some alignment researchers see it as their task to align AGI with current humans and their values, thinking the "extrapolation" less important or that it will take care of itself, while others consider extrapolation an important part... (read more)

2Abram Demski3y
I didn't reply to this originally, probably because I think it's all pretty reasonable. My thinking on this is pretty open. In some sense, everything is extrapolation (you don't exactly "currently" have preferences, because every process is expressed through time...). But OTOH there may be a strong argument for doing as little extrapolation as possible. Well, imitating you is not quite right. (EG, the now-classic example introduced with the CIRL framework: you want the AI to help you make coffee, not learn to drink coffee itself.) Of course maybe it is imitating you at some level in its decision-making, like, imitating your way of judging what's good. I'm thinking things like: will it disobey requests which it understands and is capable of? Will it fight you? Not to say that those things are universally wrong to do, but they could be types of alignment we're shooting for, and inconsistencies do seem to create trouble there. Presumably if we know that it might fight us, we would want to have some kind of firm statement about what kind of "better" reasoning would make it do so (e.g., it might temporarily fight us if we were severely deluded in some way, but we want pretty high standards for that).
1David Manheim3y
I've been thinking about whether you can have AGI that only aims for pareto-improvements, or a weaker formulation of that, in order to align with inconsistent values among groups of people. This is strongly based on Eric Drexler's thoughts on what he has called "pareto-topia". (I haven't gotten anywhere thinking about this because I'm spending my time on other things.)

Is this open thread not going to be a monthly thing?

FWIW I liked reading the comment threads here, and would be inclined to participate in the future. But that's just my opinion. I'm curious if more senior people had reasons for not liking the idea?

2Rohin Shah3y
I expected that it would be better for me to polish ideas before posting on the forum, and treated this as an experiment to check. I think it broadly confirmed my original view, so I'm not very likely to post top-level comments on open threads in the future, and I told the admins so. I don't know what their decision process was after that. (Possibly they expected that future open threads would be much quieter, since the two biggest comment threads here were both started by my top-level comments.)

Huh, that's a good point. Whereas it seems probably inevitable that AI research would've eventually converged on something similar to the current D(R)L paradigm, we can imagine a lot of different ways AI safety could have looked like instead right now. Which makes sense, since the latter is still young and in a kind of pre-paradigmatic philosophical stage, with little unambiguous feedback to dictate how things should unfold (and it's far from clear when substantially more of this feedback will show up).

I can imagine an alternate timeline whe... (read more)

Yes, perhaps I should've been more clear. Learning certain distance functions is a practical solution to some things, so maybe the phrase "distance functions are hard" is too simplistic. What I meant to say is more like

Fully-specified distance functions are hard, over and above the difficulty of formally specifying most things, and it's often hard to notice this difficulty

This is mostly applicable to Agent Foundations-like research, where we are trying to give a formal model of (some aspect of) how agents work. Sometimes, we can reduce... (read more)

2John Maxwell3y
Well, a classifier that is 100% accurate would also do the job ;) (I'm not sure a 100% accurate classifier is feasible per se, but a classifier which can be made arbitrarily accurate given enough data/compute/life-long learning experience seems potentially feasible.) Also, small perturbations aren't necessarily the only way to construct adversarial examples. Suppose I want to attack a model M1, which I have access to, and I also have a more accurate model M2. Then I could execute an automated search for cases where M1 and M2 disagree. (Maybe I use gradient descent on the input space, maximizing an objective function corresponding to the level of disagreement between M1 and M2.) Then I hire people on Mechanical Turk to look through the disagreements and flag the ones where M1 is wrong. (Since M2 is more accurate, M1 will "usually" be wrong.) This is actually one way to look at what's going on with traditional small perturbation adversarial examples. M1 is a deep learning model and M2 is a 1-nearest-neighbor model--not very good in general, but quite accurate in the immediate region of data points with known labels. The problem is that deep learning models don't have a very strong inductive bias towards mapping nearby inputs to nearby outputs (sometimes called "Lipschitzness [http://mathworld.wolfram.com/LipschitzFunction.html]"). L2 regularization actually makes deep learning models more Lipschitz because smaller coefficients=smaller eigenvalues for weight matrices=less capacity to stretch nearby inputs away from each other in output space. I think maybe that's part of why L2 regularization works. Hoping to expand the previous two paragraphs into a paper with Matthew Barnett before too long--if anyone wants to help us get it published, please send me a PM (neither of us has ever published a paper before).