All of Davidmanheim's Comments + Replies

Is the best way to suggest how to do political and policy strategy, or coordination, to post it publicly on Lesswrong? This seems obviously suboptimal, and I'd think that you should probably ask for feedback and look into how to promote cooperation privately first.

That said, I think everything you said here is correct on an object level, and worth thinking about.

3Evan Hubinger18d
I have done this also.

Strongly agree. Three examples of work I've put on Arxiv which originated from the forum, which might be helpful as a touchstone. The first was cited 7 times the first year, and 50 more times since.  The latter two were posted last year, and have not been indexed by Google as having been cited yet. 

As an example of a technical but fairly conceptual paper, there is the Categorizing Goodhart's law paper. I pushed for this to be a paper rather than just a post, and I think that the resulting exposure was very worthwhile. Scott wrote the original pos... (read more)

Seconding the .tex export, since it's much more useful than just getting a pdf!

That's correct. My point is that measuring goals which are not natural to measure will, in general, have many more problems with Goodharting and similar misoptimization and overoptimization pressures. And other approaches can be more productive, or at least more care is needed with design of metrics rather than discovery of what to measure and how.

I think this is going to be wrong as an approach. Weight and temperature are properties of physical systems at specific points in time, and can be measured coherently because we understand laws about those systems. Alignment could be measured as a function of a particular system at a specific point in time, once we have a clear understanding of what? All of human values? 

I'm not arguing that "alignment" specifically is the thing we should be measuring.

More generally, a useful mantra is "we do not get to choose the ontology". In this context, it means that there are certain things which are natural to measure (like temperature and weight), and we do not get to pick what they are; we have to discover what they are.

Depends on how you define the measure over jobs. If you mean "the jobs of half of all people," probably true. If you mean "half of the distinct jobs as they are classified by NAICS or similar," I think I disagree. 

Question: "effective arguments for the importance of AI safety" - is this about arguments for the importance of just technical AI safety, or more general AI safety, to include governance and similar things?

Think of it as a "practicing a dark art of rationality" post, and I'd think it would seem less off-putting.

2Ben Pace5mo
I think it would be less "off-putting" if we had common knowledge of it being such a post. I think the authors don't think of it as that from reading Sidney's comment.

Please feel free to repost this  elsewhere, and/or tell people about it. 

And if there is anyone interested in this type of job, but is currently still in school or for other reasons is unable to work full time at present, we encourage them to apply and note the circumstances, as we may be able to find other ways to support their work, or at least collaborate and provide mentorship.

I'm not sure I agree with the compatibility of discontinuity and prosaic alignment, though you make a reasonable case, but I do think there is compatibility between slower governance approaches and discontinuity, if it is far enough away.

In the post, I wanted to distinguish between two things you're now combining; how hard alignment is, and how long we have. And yes, combining these, we get the issue of how hard it will be to solve alignment in the time frame we have until we need to solve it. But they are conceptually distinct.

And neither of these directly relates to takeoff speed, which in the current framing is something like the time frame from when we have systems that are near-human until they hit a capability discontinuity. You said "First off, takeoff speed and timing are correlate... (read more)

2Samuel Dylan Martin6mo
Like I said in my first comment, the in practice difficulty of alignment is obviously connected to timeline and takeoff speed. But you're right that you're talking about the intrinsic difficulty of alignment Vs takeoff speed in this post, not the in practice difficulty. But those are also still correlated, for the reasons I gave - mainly that a discontinuity is an essential step in Eleizer style pessimism and fast takeoff views. I'm not sure how close this correlation is. Do these views come apart in other possible worlds? I.e. could you believe in a discontinuity to a core of general intelligence but still think prosaic alignment can work? I think that potentially you can - if you think that still enough capabilities in pre-HLMI AI (pre discontinuity) to help you do alignment research before dangerous HLMI shows up. But prosaic alignment seems to require more assumptions to be feasible assuming a discontinuity, like that the discontinuity doesn't occur before all the important capabilities you need to do good alignment research.

Relevant to this agenda are the failure modes I discussed in my multi-agent failures paper, which seems worth looking at in this context.

I'm skeptical that many of the problems with aggregation don't both apply to actual individual human values once extrapolated, and generalize to AIs with closely related values, but I'd need to lay out the case for that more clearly. (I did discuss the difficulty of cooperation even given compatible goals a bit in this paper, but it's nowhere near complete in addressing this issue.)

3Stuart Armstrong7mo
It's worth you write up your point and post it - that tends to clarify the issue, for yourself as well as for others.

This seems fragile in ways that make me less optimistic about the approach overall. We have strong reasons to think that value aggregation is intractable, and (by analogy,) in some ways the problem of coherence in CEV is the tricky part. That is, the problem of making sure that we're not Dutch book-able is, IIRC, NP-complete, and even worse, the problem of aggregating preferences has several impossibility results.

Edit: To clarify, I'm excited about the approach overall, and think it's likely to be valuable, but this part seems like a big problem.

0Stuart Armstrong7mo
I've posted [https://www.lesswrong.com/posts/hBJCMWELaW6MxinYW/intertheoretic-utility-comparison] on the theoretical difficulties of aggregating the utilities of different agents. But doing it in practice is much more feasible (scale the utilities to some not-too-unreasonable scale, add them, maximise sum). But value extrapolation is different from human value aggregation; for example, low power (or low impact) AIs can be defined with value extrapolation, and that doesn't need human value aggregation.

This post is both a huge contribution, giving a simpler and shorter explanation of a critical topic, with a far clearer context, and has been useful to point people to as an alternative to the main sequence. I wouldn't promote it as more important than the actual series, but I would suggest it as a strong alternative to including the full sequence in the 2020 Review. (Especially because I suspect that those who are very interested are likely to have read the full sequence, and most others will not even if it is included.)

Yes on point Number 1, and partly on point number 2.

If humans don't have incredibly complete models for how to achieve their goals, but know they want a glass of water, telling the AI to put a cup of H2O in front of them can create weird mistakes. This can even happen because of causal connections the humans are unaware of. The AI might have better causal models than the humans, but still cause problems for other reasons. In this case, a human might not know the difference between normal water and heavy water, but the AI might decide that since there are t... (read more)

This seems really exciting, and I'd love to chat about how betrayal is similar to or different than manipulation. Specifically, I think the framework I proposed in my earlier multi-agent failure modes paper might be helpful in thinking through the categorization. (But note that I don't endorse thinking of everything as Goodhart's law, despite that paper - though I still think it's technically true, it's not as useful as I had hoped.)

On the topic of growth rate of computing power, it's worth noting that we expect the model which experts have to be somewhat more complex that what we represented as "Moore's law through year " - but as with the simplification regarding CPU/GPU/ASIC compute, I'm unsure how much this is really a crux for anyone about the timing for AGI.

I would be very interested to hear from anyone who said, for example, "I would expect AGI by 2035 if Moore's law continues, but I expect it to end before 2030, and it will therefore likely take until 2050 to reach HLMI/AGI."

2Daniel_Eth1y
I think very few people would explicitly articulate a view like that, but I also think there are people who hold a view along the lines of, "Moore will continue strong for a number of years, and then after that compute/$ will grow at <20% as fast" – in which case, if we're bottlenecked on hardware, whether Moore ends several years earlier vs later could have a large effect on timelines.

I mostly agree, but we get into the details of how we expect improvements can occur much more in the upcoming posts on paths to HLMI and takeoff speeds.

Note: I think that this is a better written-version of what I was discussing when I revisited selection versus control, here: https://www.lesswrong.com/posts/BEMvcaeixt3uEqyBk/what-does-optimization-mean-again-optimizing-and-goodhart (The other posts in that series seem relevant.)

I didn't think about the structure that search-in territory / model-based optimization allows, but in those posts I mention that most optimization iterates back and forth between search-in-model and search-in-territory, and that a key feature which I think you're ignoring here is cost of samples / iteration. 

Selection in humans is via mutation, so that closely related organisms can get a benefit form cooperating, even at the cost of personally not replicating. As a JBS Haldane quote puts it, "I would gladly give up my life for two brothers, or eight cousins."

Continuing from that paper, explaining it better than I could;

"What is more interesting, it is only in such small populations that natural selection would favour the spread of genes making for certain kinds of altruistic behaviour. Let us suppose that you carry a rare gene which affects your behaviour so t... (read more)

1Daniel Kokotajlo2y
Right, so... we need to make sure selection in AIs also has that property? Or is the thought that even if AIs evolve to be honest, it'll only be with other AIs and not with humans? As an aside, I'm interested to see more explanations for altruism lined up side by side and compared. I just finished reading a book that gave a memetic/cultural explanation rather than a genetic one.

My point was that deception will almost certainly outperform honesty/cooperation when AI is interacting with humans, and in reflection, seems likely do so even interacting with other AIs by default because there is no group selection pressure.

4Daniel Kokotajlo2y
I think I was thinking that in multi-agent training environments there might actually be group selection pressure for honesty. (Or at least, there might be whatever selection pressures produced honesty in humans, even if that turns out to be something other than group selection.)

In the spirit of open peer review, here are a few thoughts:

First, overall, I was convinced during earlier discussions that this is a bad idea - not because of costs, but because the idea lacks real benefits, and itself will not serve the necessary functions. Also see this earlier proposal (with no comments). There are already outlets that allow robust peer review, and the field is not well served by moving away from the current CS / ML dynamic of arXiv papers and presentations at conferences, which allow for more rapid iteration and collaboration / buildin... (read more)

2Daniel Kokotajlo2y
+1 to each of these. May I suggest, instead of creating a JAA, we create a textbook? Or maybe a "special compilation" book that simply aggregates stuff? Or maybe even an encyclopedia? It's like a journal, except that it doesn't prevent these things from being published in normal academic journals as well.

Honesty, too, arose that way. So I'm not sure whether (say) a system trained to answer questions in such a way that the humans watching it give reward would be more or less likely to be deceptive.

I think it is mistaken. (Or perhaps I don't understand a key claim / assumption.)

Honesty evolved as a group dynamic, where it was beneficial for the group to have ways for individuals to honestly commit, or make lying expensive in some way. That cooperative pressure dynamic does not exist when a single agent is "evolving" on its own in an effectively static enviro... (read more)

1Daniel Kokotajlo2y
I'm confused because the stuff you wrote in the paragraph seems like an expanded version of what I think. In other words it supports what I said rather than objects to it.

Strongly agree that it's unclear that there failures would be detected. 
For discussion and examples, see my paper here: https://www.mdpi.com/2504-2289/3/2/21/htm 

Another possible argument is that we can't tell when multiple AIs are failing or subverting each other.
Each agent pursuing its own goals in a multi-agent environment are intrinsically manipulative, and when agents are manipulating one another, it happens in ways that we do not know how to detect or consider. This is somewhat different than when they manipulate humans, where we have a clear idea of what does and does not qualify as harmful manipulation.

re: #5, that doesn't seem to claim that we can infer U given their actions, which is what the impossibility of deducing preferences is actually claiming. That is, assuming 5, we still cannot show that there isn't some  such that .

(And as pointed out elsewhere, it isn't Stuart's thesis, it's a well known and basic result in the decision theory / economics / philosophy literature.)

1Vanessa Kosoy2y
You misunderstand the intent. We're talking about inverse reinforcement learning. The goal is not necessarily inferring the unknown U, but producing some behavior that optimizes the unknown U. Ofc if the policy you're observing is optimal then it's trivial to do so by following the same policy. But, using my approach we might be able to extend it into results like "the policy you're observing is optimal w.r.t. certain computational complexity, and your goal is to produce an optimal policy w.r.t. higher computational complexity." (Btw I think the formal statement I gave for 5 is false, but there might be an alternative version that works.) I am referring to this [http://papers.neurips.cc/paper/7803-occams-razor-is-insufficient-to-infer-the-preferences-of-irrational-agents] and related work by Armstrong.

I think there needs to be individual decisionmaking (on the part of both organizations and individual researchers, especially in light of the unilateralists' curse,) alongside a much broader discussion about how the world should handle unsafe machine learning, and more advanced AI.

I very much don't think that the AI safety community debating and coming up with shared, semi-public guidelines for, essentially, what to withhold from the broader public, done without input from the wider ML / AI and research community who are impacted and whose work is a big part of what we are discussing, would be wise. That community needs to be engaged in any such discussions.

I think a page titled "here are some tools and resources for thinking about AI-related infohazards" would be helpful and uncontroversial and feasible... That could include things like a list of trusted people in the community who have an open offer to discuss and offer feedback in confidence, and links to various articles and guidelines on the topic (without necessarily "officially" endorsing any particular approach), etc.

I agree that your proposal is well worth doing, it just sounds a lot more ambitious and long-term.

I'm not talking about guidelines for the wider AI community. I'm talking about guidelines for my own research (and presumably other alignment researchers would be interested in the same). The wider AI community doesn't share my assumptions about AI risk. In particular, I believe that most of what they're doing is actively harmful. Therefore, I don't expect them to accept these guidelines, and I'm also mostly uninterested in their input. Moreover, it's not the broader public that worries me, but precisely the broader AI community. It is from them that I wan... (read more)

There's some intermediate options available instead of just "full secret" or "full publish"... and I haven't seen anyone mention that...

OpenAI's phased release of GPT2 seems like a clear example of exactly this. And there is a forthcoming paper looking at the internal deliberations around this from Toby Shevlane, in addition to his extant work on the question of how disclosure potentially affects misuse.

The first thing I would note is that stakeholders need to be involved in making any guidelines, and that pushing for guidelines from the outside is unhelpful, if not harmful, since it pushes participants to be defensive about their work. There are also an extensive literature discussing the general issue of information dissemination hazards and the issues of regulation in other domains, such as nuclear weapons technology, biological and chemical weapons, and similar.

There is also a fair amount of ongoing work on synthesizing this literature and the implic... (read more)

2Vanessa Kosoy2y
Hmm, maybe I was unclear. When I said that "we need to have a public debate on the topic inside the community" I meant, the community of AI alignment researchers. So, not from the outside. As to the links, thank you. They do seem like potentially valuable inputs into the debate, although (from skimming) they don't seem to reach the point of proposing concrete guidelines and procedures.

Oh. Right. I should have gotten the reference, but wasn't thinking about it.

I'd focus even more, (per my comment to Vanniver's response,) and ask "What parts of OpenAI are most and least valuable, and how do these relate to their strategy - and what strategy is best?"

I would reemphasize that the "does OpenAI increase risks" is a counterfactual question. That means we need to be clearer about what we are asking as a matter of predicting what the counterfactuals are, and consider strategy options for going forward. This is a major set of questions, and increasing or decreasing risks as a single metric isn't enough to capture much of interest.

For a taste of what we'd want to consider, what about the following:

Are we asking OpenAI to pick a different, "safer" strategy?

Perhaps they should focu... (read more)

3Matthew "Vaniver" Graves2y
Also apparently Megaman is less popular than I thought so I added links to the names.
Now the perhaps harder step is trying to get traction on them

Yes, very much so. We're working on a few parts of this now, as part of a different project, but I agree that it's tricky. And there are a number of other things that seem like potentially very useful projects if others are interested in collaborations, or just some ideas / suggestions about how they could be approached.

(On the tables, unfortunately the tables were pasted in as images from another program. We should definitely see if we can get higher-resolution, even if we can't convert to text easily.)

I'm unsure that GPT3 can output, say, a ipython notebook to get the values it wants.

That would be really interesting to try...

(I really like this post, as I said to Issa elsewhere, but) I realized after discussing this earlier that I don't agree with a key part of the precise vs. imprecise model distinction.

A precise theory is one which can scale to 2+ levels of abstraction/indirection.
An imprecise theory is one which can scale to at most 1 level of abstraction/indirection.

I think this is wrong. More levels of abstraction are worse, not better. Specifically, if a model exactly describes a system on one level, any abstraction will lose predictive power. (Ignoring computationa... (read more)

I think this is covered in my view of optimization via selection, where "direct solution" is the third option. Any one-shot optimizer is implicitly relying on an internal model completely for decision making, rather than iterating, as I explain there. I think that is compatible with the model here, but it needs to be extended slightly to cover what I was trying to say there.

I think this is great.

I would want to relate it to a few key points out which I tried to address in a few earlier posts. Principally, I discussed selection versus control, which is about the difference between what optimization does externally, and how it uses models and testing. This related strongly to your conception of an optimizing system, but focused on how much of the optimization process occurs in the system versus in the agent itself. This is principally important because of how it relates to misalignment and Goodharting of various types.

I had ho... (read more)

Note to add: We did formalize this more, and it has been available on Arxiv for quite a while.

2 points about how I think about this that differs significantly. (I just read up on Bolker and Jeffrey, as I was previously unfamiliar.) I had been thinking about writing this up more fully, but have been busy. (i.e. if people think it's worthwhile, tell me and I will be more likely do so.)

First, utility is only ever computed over models of reality, not over reality itself, because it is a part of the decision making process, not directly about any self-monitoring or feedback process. It is never really evaluated against reality, nor does it need to... (read more)

This seems related to my speculations about multi-agent alignment. In short, for embedded agents, having a tractable complexity of building models of other decision processes either requires a reflexively consistent view of their reactions to modeling my reactions to their reactions, etc. - or it requires simplification that clearly precludes ideal Bayesian agents. I made the argument much less formally, and haven't followed the math in the post above (I hope to have time to go through more slowly at some point.)

To lay it out here, the basic argument ... (read more)

This is a fantastic set of definitions, and it is definitely useful. That said, I want to add something to what you said near the end. I think the penultimate point needs further elaboration. I've spoken about "multi-agent Goodhart" in other contexts, and discussed why I think it's a fundamentally hard problem, but I don't think I've really clarified how I think this relates to alignment and takeoff. I'll try to do that below.

Essentially, I think that the question of multipolarity versus individual or collective takeoff ... (read more)

I think that more engagement in this area is useful, and mostly agree. I'll point out that I think much of the issue with powerful agents and missed consequences is more usefully captured by work on Goodhart's law, which is definitely my pet idea, but seems relevant. I'll self promote shamelessly here.

Technical-ish paper with Scott Garrabrant: https://arxiv.org/abs/1803.04585

A more qualitative argument about multi-agent cases, with some examples of how it's already failing: https://www.mdpi.com/2504-2289/3/2/21/htm

A hopefully someda... (read more)

See my other reply about pseudo-pareto improvements - but I think the "understood + endorsed" idea is really important, and worth further thought.

My current best-understanding is that if we assume people have arbitrary inconsistencies, it will be impossible to do better than satisfice on different human values by creating near-pareto improvements for intra-human values. But inconsistent values don't even allow pareto-improvements! Any change makes things incomparable. Given that, I think we do need a super-prospect theory that explains in a systematic way what humans do "wrong" so that we can pick what an AI should respect of human preferences, and what can be ignored.

For instance, I ... (read more)


"Arguably, you can't fully align with inconsistent preferences"
My intuitions tend to agree, but I'm also inclined to ask "why not?" e.g. even if my preferences are absurdly cyclical, but we get AGI to imitate me perfectly (or me + faster thinking + more information), under what sense of the word is it "unaligned" with me? More generally, what is it about these other coherence conditions that prevent meaningful "alignment"? (Maybe it takes a big discursive can of worms, but I actually haven't seen this
... (read more)
2Abram Demski3y
Yeah, I think something like this is pretty important. Another reason is that humans inherently don't like to be told, top-down, that X is the optimal solution. A utilitarian AI might redistribute property forcefully, where a pareto-improving AI would seek to compensate people. An even more stringent requirement which seems potentially sensible: only pareto-improvements which both parties both understand and endorse. (IE, there should be something like consent.) This seems very sensible with small numbers of people, but unfortunately, seems infeasible for large numbers of people (given the way all actions have side-effects for many many people).

I don't think you're putting enough weight on what REALLY convinced economists, which was the tractability that assuming utility provides, and their enduring physics envy. (But to be fair, who wouldn't wish that their domain was as tractable as Newtonian physics ended up being.)

But yes, Utility is a useful enough first approximation for humans that it's worth using as a starting point. But only as a starting point. Unfortunately, too many economists are instead busy building castles on their assumptions, without trying to work with bett... (read more)

Yeah, I don't 100% buy the arguments which I gave in bullet-points in my previous comment.

But I guess I would say the following:

I expect to basically not buy any descriptive theory of human preferences. It doesn't seem likely we could find super-prospect theory which really successfully codified the sort of inconsistencies which we see in human values, and then reap some benefits for AI alignment.

So it seems like what you want to do instead is make very few assumptions at all. Assume that the human can do things like answer questions, but don&apo... (read more)

Glad to see engagement on this - and I should probably respond to some of these points, but before doing so, want to point to where I've already done work on this, since much of that work either admits your points, or addresses them.

First, I think you should read the paper I wrote with Scott that extended the thoughts from his post. It certainly doesn't address all of this, but we were very clear that adversarial Goodhart was less clear than the other modes and needed further work. We also more clearly drew the connection to tails fall apart, and... (read more)

This post has significant changed my mental model of how to understand key challenges in AI safety, and also given me a clearer understanding of and language for describing why complex game-theoretic challenges are poorly specified or understood. The terms and concepts in this series of posts have become a key part of my basic intellectual toolkit.

I don 't think this is straightforward in practice - and putting a cartesian boundary in place is avoiding exactly the key problem. Any feature of the world used as the item to minimize/maximize is measured, and uncorruptable measurement systems seems like a non-trivial problem. For instance, how do I get my GAI to maximize blue in an area instead of maximizing the blue input into their sensor when pointed at that area? We need to essentially solve value loading and understand a bunch of embedded agent issues to really talk about this.

Load More