All of johnswentworth's Comments + Replies

Testing The Natural Abstraction Hypothesis: Project Intro

On the role of values: values clearly do play some role in determining which abstractions we use. An alien who observes Earth but does not care about anything on Earth's surface will likely not have a concept of trees, any more than an alien which has not observed Earth at all. Indifference has a similar effect to lack of data.

However, I expect that the space of abstractions is (approximately) discrete. A mind may use the tree-concept, or not use the tree-concept, but there is no natural abstraction arbitrarily-close-to-tree-but-not-the-same-as-tree. There... (read more)

Testing The Natural Abstraction Hypothesis: Project Intro

Also interested in helping on this - if there's modelling you'd want to outsource.

Here's one fairly-standalone project which I probably won't get to soon. It would be a fair bit of work, but also potentially very impressive in terms of both showing off technical skills and producing cool results.

Short somewhat-oversimplified version: take a finite-element model of some realistic objects. Backpropagate to compute the jacobian of final state variables with respect to initial state variables. Take a singular value decomposition of the jacobian. Hypothesis: th... (read more)

Testing The Natural Abstraction Hypothesis: Project Intro

Re: dual use, I do have some thoughts on exactly what sort of capabilities would potentially come out of this.

The really interesting possibility is that we end up able to precisely specify high-level human concepts - a real-life language of the birds. The specifications would correctly capture what-we-actually-mean, so they wouldn't be prone to goodhart. That would mean, for instance, being able to formally specify "strawberry on a plate" in non-goodhartable way, so an AI optimizing for a strawberry on a plate would actually produce a strawberry on a plate... (read more)

How do we prepare for final crunch time?

Re: picking up new tools, skills and practice designing and building user interfaces, especially to complex or not-very-transparent systems, would be very-high-leverage if the tool-adoption step is rate-limiting.

How do we prepare for final crunch time?

Relevant topic of a future post: some of the ideas from Risks From Learned Optimization or the Improved Good Regulator Theorem offer insights into building effective institutions and developing flexible problem-solving capacity.

Rough intuitive idea: intelligence/agency are about generalizable problem-solving capability. How do you incentivize generalizable problem-solving capability? Ask the system to solve a wide variety of problems, or a problem general enough to encompass a wide variety.

If you want an organization to act agenty, then a useful technique ... (read more)

Transparency Trichotomy

This post seems to me to be beating around the bush. There's several different classes of transparency methods evaluated by several different proxy criteria, but this is all sort of tangential to the thing which actually matters: we do not understand what "understanding a model" means, at least not in a sufficiently-robust-and-legible way to confidently put optimization pressure on it.

For transparency via inspection, the problem is that we don't know what kind of "understanding" is required to rule out bad behavior. We can notice that some low-level featur... (read more)

5Mark Xu14dI agree it's sort of the same problem under the hood, but I think knowing how you're going to go from "understanding understanding" to producing an understandable model controls what type of understanding you're looking for. I also agree that this post makes ~0 progress on solving the "hard problem" of transparency, I just think it provides a potentially useful framing and creates a reference for me/others to link to in the future.
The Fusion Power Generator Scenario

One way in which the analogy breaks down: in the lever case, we have two levers right next to each other, and each does something we want - it's just easy to confuse the levers. A better analogy for AI might be: many levers and switches and dials have to be set to get the behavior we want, and mistakes in some of them matter while others don't, and we don't know which ones matter when. And sometimes people will figure out that a particular combination extends the flaps, so they'll say "do this to extend the flaps", except that when some other switch has th... (read more)

Alignment By Default

Yup, this is basically where that probability came from. It still feels about right.

Alignment By Default

This is a great explanation. I basically agree, and this is exactly why I expect alignment-by-default to most likely fail even conditional on the natural abstractions hypothesis holding up.

3Chris_Leong21dAlso, I have another strange idea that might increase the probability of this working. If you could temporarily remove proxies based on what people say, then this would seem to greatly increase the chance of it hitting the actual embedded representation of human values. Maybe identifying these proxies is easier than identifying the representation of "true human values"? I don't think it's likely to work, but thought I'd share anyway.
1Chris_Leong21dThanks! Is this why you put the probability as "10-20% chance of alignment by this path, assuming that the unsupervised system does end up with a simple embedding of human values"? Or have you updated your probabilities since writing this post?
HCH Speculation Post #2A

This is the best explanation I have seen yet of what seem to me to be the main problems with HCH. In particular, that scene from HPMOR is one that I've also thought about as a good analogue for HCH problems. (Though I do think the "humans are bad at things" issue is more probably important for HCH than the malicious memes problem; HCH is basically a giant bureaucracy, and the same shortcomings which make humans bad at giant bureaucracies will directly limit HCH.)

Behavioral Sufficient Statistics for Goal-Directedness

I'm working on writing it up properly, should have a post at some point.

EDIT: it's up.

Behavioral Sufficient Statistics for Goal-Directedness

I still feel like you're missing something important here.

For instance... in the explainability factor, you measure "the average deviation of  from the actions favored by the action-value function  of ", using the formula

  

. But why this particular formula? Why not take the log of  first, or use  in the denominator? Indeed, there's a strong argument to be made this formula is a bad choice: the value function  is... (read more)

4Adam Shimi1moTo people reading this thread: we had a private conversation with John (faster and easier), which resulted in me agreeing with you. The summary is that you can see the arguments made and constraints invoked as a set of equations, such that the adequate formalization is a solution of this set. But if the set has more than one solution (maybe a lot), then it's misleading to call that the solution. So I've been working these last few days at arguing for the properties (generalization, explainability, efficiency) in such a way that the corresponding set of equations only has one solution.
The case for aligning narrowly superhuman models

We can't mostly-win just by fine-tuning a language model to do moral discourse.

Uh... yeah, I agree with that statement, but I don't really see how it's relevant. If we tune a language model to do moral discourse, then won't it be tuned to talk about things like Terry Schiavo, which we just said was not that central? Presumably tuning a language model to talk about those sorts of questions would not make it any good at moral problems like "they said they want fusion power, but they probably also want it to not be turn-into-bomb-able".

Or are you using "moral... (read more)

Behavioral Sufficient Statistics for Goal-Directedness

I think you are very confused about the conceptual significance of a "sufficient statistic".

Let's start with the prototypical setup of a sufficient statistic. Suppose I have a bunch of IID variables  drawn from a maximum-entropy distribution with features  (i.e. the "true" distribution is maxentropic subject to a constraint on the expectation of ), BUT I don't know the parameters of the distribution (i.e. I don't know the expected value ). For instance, maybe I know that the variables are drawn from a normal... (read more)

3Adam Shimi1moThanks for the spot-on pushback! I do understand what a sufficient statistics is -- which probably means I'm even more guilty of what you're accusing me of. And I agree completely that I don't defend correctly that the statistics I provide are really sufficient. If I try to explain myself, what I want to say in this post is probably something like * Knowing these intuitive properties aboutπand the goals seems sufficient to express and address basically any question we have related to goals and goal-directedness. (in a very vague intuitive way that I can't really justify). * To think about that in a grounded way, here are formulas for each property that look like they capture these properties. * Now what's left to do is to attack the aforementioned questions about goals and goal-directedness with these statistics, and see if they're enough. (Which is the topic of the next few posts) Honestly, I don't think there's an argument to show these are literally sufficient statistics. Yet I still think staking the claim that they are is quite productive for further research. It gives concreteness to an exploration of goal-directedness, carving more grounded questions: * Given a question about goals and goal-directedness, are these properties enough to frame and study this question? If yes, then study it. If not, then study what's missing. * Are my formula adequate formalization of the intuitive properties? This post mostly focuses on the second aspect, and to be honest, not even in as much detail as one could go. Maybe that means this post shouldn't exist, and I should have waited to see if I could literally formalize every question about goals and goal-directedness. But posting it to gather feedback on whether these statistics makes sense to people, and if they feel like something's missing, seemed valuable. That being said, my mistake (and what caused your knee-jerk reaction) was to just say these are literally sufficient statistics inst
The case for aligning narrowly superhuman models

I think one argument running through a lot of the sequences is that the parts of "human values" which mostly determine whether AI is great or a disaster are not the sort of things humans usually think of as "moral questions". Like, these examples from your comment below:

Was it bad to pull the plug on Terry Schiavo? How much of your income should you give to charity? Is it okay to kiss your cousin twice removed? Is it a good future if all the humans are destructively copied to computers? Should we run human challenge trials for covid-19 vaccines?

If an AGI i... (read more)

1Charlie Steiner1moI'd say "If an AGI is hung up on these sorts of questions [i.e. the examples I gave of statements human 'moral experts' are going to disagree about], then we've already mostly-won" is an accurate correlation, but doesn't stand up to optimization pressure. We can't mostly-win just by fine-tuning a language model to do moral discourse. I'd guess you agree? Anyhow, my point was more: You said "you get what you can measure" is a problem because the fact of the matter for whether decisions are good or bad is hard to evaluate (therefore sandwiching is an interesting problem to practice on). I said "you get what you measure" is a problem because humans can disagree when their values are 'measured' without either of them being mistaken or defective (therefore sandwiching is a procrustean bed / wrong problem).
Open Problems with Myopia

(On reflection this comment is less kind than I'd like it to be, but I'm leaving it as-is because I think it is useful to record my knee-jerk reaction. It's still a good post; I apologize in advance for not being very nice.)

In theory, such an agent is safe because a human would only approve safe actions.

... wat.

Lol no.

Look, I understand that outer alignment is orthogonal to the problem this post is about, but like... say that. Don't just say that a very-obviously-unsafe thing is safe. (Unless this is in fact nonobvious, in which case I will retract this comment and give a proper explanation.)

3Charlie Steiner1moYou beat me to making this comment :P Except apparently I came here to make this comment about the changed version. "A human would only approve safe actions" is just a problem clause altogether. I understand how this seems reasonable for sub-human optimizers, but if you (now addressing Mark and Evan) think it has any particular safety properties for superhuman optimization pressure, the particulars of that might be interesting to nail down a bit better.
7Mark Xu1moYeah, you're right that it's obviously unsafe. The words "in theory" were meant to gesture at that, but it could be much better worded. Changed to "A prototypical example is a time-limited myopic approval-maximizing agent. In theory, such an agent has some desirable safety properties because a human would only approve safe actions (although we still would consider it unsafe)."
The case for aligning narrowly superhuman models

"As capable as an expert" makes more sense. Part of what's confusing about "equivalent to a human thinking for a long time" is that it's picking out one very particular way of achieving high capability, but really it's trying to point to a more-general notion of "HCH can solve lots of problems well". Makes it sound like there's some structural equivalence to a human thinking for a long time, which there isn't.

5Rohin Shah1moYes, I explicitly agree with this, which is why the first thing in my previous response was
The case for aligning narrowly superhuman models

I see, so it's basically assuming that problems factor.

3Ajeya Cotra1moYeah, in the context of a larger alignment scheme, it's assuming that in particular the problem of answering the question "How good is the AI's proposed action?" will factor down into sub-questions of manageable size.
The case for aligning narrowly superhuman models

Where did this idea of HCH yielding the same benefits as a human thinking for a long time come from??? Both you and Ajeya apparently have this idea, so presumably it was in the water at some point? Yet I don't see any reason at all to expect it to do anything remotely similar to that.

4Adam Shimi1moWell, Paul's original post [https://ai-alignment.com/humans-consulting-hch-f893f6051455] presents HCH as the specification of a human enlightened judgement. And if we follow the links to Paul's previous post about this concept [https://ai-alignment.com/implementing-our-considered-judgment-6c715a239b3e], he does describe his ideal implementation of considered judgement (what will become HCH) using the intuition of thinking for decent amount of time. So it looks to me like "HCH captures the judgment of the human after thinking from a long time" is definitely a claim made in the post defining the concept. Whether it actually holds is another (quite interesting) question that I don't know the answer. A line of thought about this that I explore in Epistemology of HCH [https://www.alignmentforum.org/posts/CDSXoC54CjbXQNLGr/epistemology-of-hch#HCH_as_philosophical_abstraction] is the comparison between HCH and CEV [https://arbital.com/p/cev/]: the former is more operationally concrete (what I call an intermediary alignment scheme), but the latter can directly state the properties it has (like giving the same decision that the human after thinking for a long time), whereas we need to argue for them in HCH.
5Rohin Shah1moI agree with the other responses from Ajeya / Paul / Raemon, but to add some more info: ... I don't really know. My guess is that I picked it up from reading giant comment threads between Paul and other people. Tbc it doesn't need to be literally true. The argument needed for safety is something like "a large team of copies of non-expert agents could together be as capable as an expert". I see the argument "it's probably possible for a team of agents to mimic one agent thinking for a long time" as mostly an intuition pump for why that might be true.
5Ajeya Cotra1moThe intuition for it is something like this: suppose I'm trying to make a difficult decision, like where to buy a house. There are hundreds of cities I'd be open to, each one has dozens of neighborhoods, and each neighborhood has dozens of important features, like safety, fun things to do, walkability, price per square foot, etc. If I had a long time, I would check out each neighborhood in each city in turn and examine how it does on each dimension, and pick the best neighborhood. If I instead had an army of clones of myself, I could send many of them to each possible neighborhood, with each clone examining one dimension in one neighborhood. The mes that were all checking out different aspects of neighborhood X can send up an aggregated judgment to a me that is in charge of "holistic judgment of neighborhood X", and the mes that focus on holistic judgments of neighborhoods can do a big pairwise bracket to filter up a decision to the top me.
4Raymond Arnold1moI had formed an impression that the hope was that the big chain of short thinkers would in fact do a good enough job factoring their goals that it would end up comparable to one human thinking for a long time (and that Ought was founded to test that hypothesis)
The case for aligning narrowly superhuman models

HCH is more like an infinite bureaucracy. You have some underlings who you can ask to think for a short time, and those underlings have underlings of their own who they can ask to think for a short time, and so on. Nobody in HCH thinks for a long time, though the total thinking time of one person and their recursive-underlings may be long.

(This is exactly why factored cognition is so important for HCH & co: the thinking all has to be broken into bite-size pieces, which can be spread across people.)

1Ajeya Cotra1moYes sorry — I'm aware that in the HCH procedure no one human thinks for a long time. I'm generally used to mentally abstracting HCH (or whatever scheme fits that slot) as something that could "effectively replicate the benefits you could get from having a human thinking a long time," in terms of the role that it plays in an overall scheme for alignment. This isn't guaranteed to work out, of course. My position is similar to Rohin's above:
The case for aligning narrowly superhuman models

How does iterated amplification achieve this? My understanding was that it simulates scaling up the number of people (a la HCH), not giving one person more time.

4Rohin Shah1moYeah, sorry, that's right, I was speaking pretty loosely. You'd still have the same hope -- maybe a team of 2^100 copies of the business owner could draft a contract just as well, or better than, an already expert business-owner. I just personally find it easier to think about "benefits of a human thinking for a long time" and then "does HCH get the same benefits as humans thinking for a long time" and then "does iterated amplification get the same benefits as HCH".
3Ajeya Cotra1moMy understanding is that HCH is a proposed quasi-algorithm for replicating the effects of a human thinking for a long time.
The case for aligning narrowly superhuman models

Ah... I think we have an enormous amount of evidence on very-similar problems.

For instance: consider a lawyer and a business owner putting together a contract. The business owner has a rough intuitive idea of what they want, but lacks expertise on contracts/law. The lawyer has lots of knowledge about contracts/law, but doesn't know what the business owner wants. The business owner is like our non-expert humans; the lawyer is like GPT.

In this analogy, the analogue of an expert human would be a business owner who is also an expert in contracts/law. The analo... (read more)

5Rohin Shah1moOne approach is to let the human giving feedback think for a long time. Maybe the business owner by default can't write a good contract, but a business owner who could study the relevant law for a year would do just as well as the already expert business-owner. In the real world this is too expensive to do, but there's hope in the AI case (e.g. that's a hope behind iterated amplification).
The case for aligning narrowly superhuman models

But I don't think I agree that the most impressive-looking results will involve doing nothing to go beyond human feedback: successfully pulling off the sandwich method would most likely look significantly more impressive to mainstream ML researchers than just doing human feedback.

I partially agree with this; alignment is a bottleneck to value for GPT, and actually aligning it would likely produce some very impressive stuff. My disagreement is that it's a lot easier to make something which looks impressive than something which solves a Hard problem (like th... (read more)

4Ajeya Cotra1moI guess the crux here is "And if the Hard problem is indeed hard enough to not be solved by anyone," — I don't think that's the default/expected outcome. There hasn't been that much effort on this problem in the scheme of things, and I think we don't know where it ranges from "pretty easy" to "very hard" right now.
The case for aligning narrowly superhuman models

First and foremost, great post! "How do we get GPT to give the best health advice it can give?" is exactly the sort of thing I think about as a prototypical (outer) alignment problem. I also like the general focus on empirical directions and research-feedback mechanisms, as well as the fact that the approach could produce real economic value.

Now on to the more interesting part: how does this general strategy fail horribly?

If we set aside inner alignment and focus exclusively on outer alignment issues, then in-general the failure mode which I think is far a... (read more)

2Charlie Steiner1moHm, interesting, I'm actually worried about a totally different implication of "you get what you can measure." E.g.: "If MTurkers are on average anti-abortion and your experts are on average pro-choice, what the hell will your MTurkers think about training an algorithm that tries to learn from anti-abortion folks and output pro-choice responses? Suppose you then run that same algorithm on the experts and it gives outputs in favor of legalizing infanticide - are the humans allowed to say "hold on, I don't want that," or are we just going to accept that as what peak performance looks like? So anyhow I'm pessimistic about sandwiching for moral questions." I'm curious if the upvote disparity means I'm the minority position here :P

Thanks for the comment! Just want to explicitly pull out and endorse this part:

the experts be completely and totally absent from the training process, and in particular no data from the experts should be involved in the training process

I should have emphasized that more in the original post as a major goal. I think you might be right that it will be hard to solve the "sandwich" problem without conceptual progress, but I also think that attempts to solve the sandwich problem could directly spur that progress (not just reveal the need for it, but also ta... (read more)

The Pointers Problem: Human Values Are A Function Of Humans' Latent Variables

Yeah, I wouldn't even include reward tampering. Outer alignment, as I think about it, is mostly the pointer problem, and the (values) pointer problem is a subset of outer alignment. (Though e.g. Evan would define it differently.)

The Pointers Problem: Human Values Are A Function Of Humans' Latent Variables

Ok, a few things here...

The post did emphasize, in many places, that there may not be any real-world thing corresponding to a human concept, and therefore constructing a pointer is presumably impossible. But the "thing may not exist" problem is only one potential blocker to constructing a pointer. Just because there exists some real-world structure corresponding to a human concept, or an AI has some structure corresponding to a human concept, does not mean we have a pointer. It just means that it should be possible, in principle, to create a pointer.

So, th... (read more)

1Richard Ngo1moThanks for the reply. To check that I understand your position, would you agree that solving outer alignment plus solving reward tampering would solve the pointers problem in the context of machine learning? Broadly speaking, I think our disagreement here is closely related to one we've discussed before, about how much sense it makes to talk about outer alignment in isolation (and also about your definition of inner alignment), so I probably won't pursue this further.
The Pointers Problem: Human Values Are A Function Of Humans' Latent Variables

In other words, how do we find the corresponding variables? I've given you an argument that the variables in an AGI's world-model which correspond to the ones in your world-model can be found by expressing your concept in english sentences.

The problem is with what you mean by "find". If by "find" you mean "there exist some variables in the AI's world model which correspond directly to the things you mean by some English sentence", then yes, you've argued that. But it's not enough for there to exist some variables in the AI's world-model which correspond to... (read more)

1Richard Ngo1moAbove you say: And you also discuss how: My two concerns are as follows. Firstly, that the problems mentioned in these quotes above are quite different from the problem of constructing a feedback signal which points to a concept which we know an AI already possesses. Suppose that you meet an alien and you have a long conversation about the human concept of happiness, until you reach a shared understanding of the concept. In other words, you both agree on what "the referents of these pointers" are, and what "the real-world things (if any) to which they’re pointing" are? But let's say that the alien still doesn't care at all about human happiness. Would you say that we have a "pointer problem" with respect to this alien? If so, it's a very different type of pointer problem than the one you have with respect to a child who believes in ghosts. I guess you could say that there are two different but related parts of the pointer problem? But in that case it seems valuable to distinguish more clearly between them. My second concern is that requiring pointers to be sufficient to "to get the AI to do what we mean" means that they might differ wildly depending on the motivation system of that specific AI and the details of "what we mean". For example, imagine if alien A is already be willing to obey any commands you give, as long as it understands them; alien B can be induced to do so via operant conditioning; alien C would only acquire human values via neurosurgery; alien D would only do so after millennia of artificial selection. So in the context of alien A, a precise english phrase is a sufficient pointer; for alien B, a few labeled examples qualifies as a pointer; for alien C, identifying a specific cluster of neurons (and how it's related to surrounding neurons) serves as a pointer; for alien D, only a millennium of supervision is a sufficient pointer. And then these all might change when we're talking about pointing to a different concept. And so adding the requirem
The Pointers Problem: Human Values Are A Function Of Humans' Latent Variables

The AI knowing what I mean isn't sufficient here. I need the AI to do what I mean, which means I need to program it/train it to do what I mean. The program or feedback signal needs to be pointed at what I mean, not just whatever English-language input I give.

For instance, if an AI is trained to maximize how often I push a particular button, and I say "I'll push the button if you design a fusion power generator for me", it may know exactly what I mean and what I intend. But it will still be perfectly happy to give me a design with some unintended side effects which I'm unlikely to notice until after pushing the button.

3Richard Ngo2moI agree with all the things you said. But you defined the pointer problem as: "what functions of what variables (if any) in the environment and/or another world-model correspond to the latent variables in the agent’s world-model?" In other words, how do we find the corresponding variables? I've given you an argument that the variables in an AGI's world-model which correspond to the ones in your world-model can be found by expressing your concept in english sentences. The problem of determining how to construct a feedback signal which refers to those variables, once we've found them, seems like a different problem. Perhaps I'd call it the "motivation problem": given a function of variables in an agent's world-model, how do you make that agent care about that function? This is a different problem in part because, when addressing it, we don't need to worry about stuff like ghosts. Using this terminology, it seems like the alignment problem reduces to the pointer problem plus the motivation problem.
[AN #139]: How the simplicity of reality explains the success of neural nets

I believe the paper says that log densities are (approximately) polynomial - e.g. a Gaussian would satisfy this, since the log density of a Gaussian is quadratic.

Utility Maximization = Description Length Minimization

I'll answer the second question, and hopefully the first will be answered in the process.

First, note that , so arbitrarily large negative utilities aren't a problem - they get exponentiated, and yield probabilities arbitrarily close to 0. The problem is arbitrarily large positive utilities. In fact, they don't even need to be arbitrarily large, they just need to have an infinite exponential sum; e.g. if  is  for any whole number of paperclips , then to normalize the probability distribution we need to divid... (read more)

Utility Maximization = Description Length Minimization

Awesome question! I spent about a day chewing on this exact problem.

First, if our variables are drawn from finite sets, then the problem goes away (as long as we don't have actually-infinite utilities). If we can construct everything as limits from finite sets (as is almost always the case), then that limit should involve a sequence of world models.

The more interesting question is what that limit converges to. In general, we may end up with an improper distribution (conceptually, we have to carry around two infinities which cancel each other out). That's fine - improper distributions happen sometimes in Bayesian probability, we usually know how to handle them.

1Daniel Kokotajlo2moThanks for the reply, but I might need you to explain/dumb-down a bit more. --I get how if the variables which describe the world can only take a finite combination of values, then the problem goes away. But this isn't good enough because e.g. "number of paperclips" seems like something that can be arbitrarily big. Even if we suppose they can't get infinitely big (though why suppose that?) we face problems, see below. --What does it mean in this context to construct everything as limits from finite sets? Specifically, consider someone who is a classical hedonistic utilitarian. It seems that their utility is unbounded above and below, i.e. for any setting of the variables, there is a setting which is a zillion times better and a setting which is a zillion times worse. So how can we interpret them as minimizing the bits needed to describe the variable-settings according to some model M2? For any M2 there will be at least one minimum-bit variable-setting, which contradicts what we said earlier about every variable-setting having something which is worse and something which is better.
Formal Solution to the Inner Alignment Problem

I think that the vast majority of the existential risk comes from that “broader issue” that you're pointing to of not being able to get worst-case guarantees due to using deep learning or evolutionary search or whatever. That leads me to want to define inner alignment to be about that problem...

[Emphasis added.] I think this is a common and serious mistake-pattern, and in particular is one of the more common underlying causes of framing errors. The pattern is roughly:

  • Notice cluster of problems X which have a similar underlying causal pattern Cause(X)
  • Notice
... (read more)

I mean, I don't think I'm “redefining” inner alignment, given that I don't think I've ever really changed my definition and I was the one that originally came up with the term (inner alignment was due to me, mesa-optimization was due to Chris van Merwijk). I also certainly agree that there are “more than just inner alignment problems going on in the lack of worst-case guarantees for deep learning/evolutionary search/etc.”—I think that's exactly the point that I'm making, which is that while there are other issues, inner alignment is what I'm most concerned about. That being said, I also think I was just misunderstanding the setup in the paper—see Rohin's comment on this chain.

Suggestions of posts on the AF to review

Related to the role of peer review: a lot stuff on LW/AF is relatively exploratory, feeling out concepts, trying to figure out the right frames, etc. We need to be generally willing to ask discuss incomplete ideas, stuff that hasn't yet had the details ironed out. For that to succeed, we need community discussion standards which tolerate a high level of imperfect details or incomplete ideas. I think we do pretty well with this today.

But sometimes, you want to be like "come at me bro". You've got something that you're pretty highly confident is right, and y... (read more)

3Adam Shimi2moYeah, when I think about implementing a review process for the Alignment Forum, I'm definitely thinking about something you can ask for more polished research, in order to get external feedback and a tag saying this is peer review (for prestige and reference). Thanks for the suggestions! We'll consider them. :)
Fixing The Good Regulator Theorem

Good enough. I don't love it, but I also don't see easy ways to improve it without making it longer and more technical (which would mean it's not strictly an improvement). Maybe at some point I'll take the time to make a shorter and less math-dense writeup.

Fixing The Good Regulator Theorem

I was considering this, but the problem is that in your setup S is supposed to be derived from X (that is, S is a deterministic function of X), which is not true when X = training data and S = that which we want to predict.

That's an (implicit) assumption in Conant & Ashby's setup, I explicitly remove that constraint in the "Minimum Entropy -> Maximum Expected Utility and Imperfect Knowledge" section. (That's the "imperfect knowledge" part.)

If S is derived from X, then "information in S" = "information in X relevant to S"

Same here. Once we relax the ... (read more)

3Rohin Shah2mo... That'll teach me to skim through the math in posts I'm trying to summarize. I've edited the summary, lmk if it looks good now.
Fixing The Good Regulator Theorem

Yes! That is exactly the sort of theorem I'd expect to hold. (Though you might need to be in POMDP-land, not just MDP-land, for it to be interesting.)

Fixing The Good Regulator Theorem

Four things I'd change:

  • In the case of a neural net, I would probably say that the training data is X, and S is the thing we want to predict. Z measures (expected) accuracy of prediction, so to make good predictions with minimal info kept around from the data, we need a model. (Other applications of the theorem could of course say other things, but this seems like the one we probably want most.)
  • On point (3), M contains exactly the information from X relevant to S, not the information that S contains (since it doesn't have access to all the information S con
... (read more)
2Rohin Shah2moI was considering this, but the problem is that in your setup S is supposed to be derived from X (that is, S is a deterministic function of X), which is not true when X = training data and S = that which we want to predict. If S is derived from X, then "information in S" = "information in X relevant to S" Fair point. I kind of wanted to abstract away this detail in the operationalization of "relevant", but it does seem misleading as stated. Changed to "important for optimal performance". I was hoping that this would come through via the neural net example, where Z obviously includes new information in the form of the new test inputs which have to be labeled. I've added the sentence "Note that it is usually important that Z contains some new input (e.g. test images to be classified) to prevent M from hardcoding solutions to Z without needing to look at S" to the second point to clarify. (In general I struggled with keeping the summary short vs. staying true to the details of the causal model.)
Fixing The Good Regulator Theorem

Yeah, "get a grant" is definitely not the part of that plan which is a hard sell. Hiring people is a PITA. If I ever get to a point where I have enough things like this, which could relatively-easily be offloaded to another person, I'll probably do it. But at this point, no.

Fixing The Good Regulator Theorem

Oh absolutely, the original is still awful and their proof does not work with the construction I just gave.

BTW, this got a huge grin out of me:

Status: strong opinions, weakly held. not a control theorist; not only ready to eat my words, but I've already set the table. 

As I understand it, the original good regulator theorem seems even dumber than you point out.

Fixing The Good Regulator Theorem

The reason I think entropy minimization is basically an ok choice here is that there's not much restriction on which variable's entropy is minimized. There's enough freedom that we can transform an expected-utility-maximization problem into an entropy-minimization problem.

In particular, suppose we have a utility variable U, and we want to maximize E[U]. As long as possible values of U are bounded above, we can subtract a constant without changing anything, making U strictly negative. Then, we define a new random variable Z, which is generated from U in suc... (read more)

3Alex Turner2moOkay, I agree that if you remove their determinism & full observability assumption (as you did in the post), it seems like your construction should work. I still think that the original paper seems awful (because it's their responsibility to justify choices like this in order to explain how their result captures the intuitive meaning of a 'good regulator').
Fixing The Good Regulator Theorem

Note on notation...

You can think of something like  as a python dictionary mapping x-values to the corresponding  values. That whole dictionary would be a function of Y. In the case of something like , it's a partial policy mapping each second-input-value y and regulator output value r to the probability that the regulator chooses that output value on that input value, and we're thinking of that whole partial policy as a function of the first input value X. So, it's a function which is itself a rand... (read more)

Fixing The Good Regulator Theorem

Your bullet points are basically correct. In practice, applying the theorem to any particular NN would require some careful setup to make the causal structure match - i.e. we have to designate the right things as "system", "regulator", "map", "inputs X & Y", and "outcome", and that will vary from architecture to architecture. But I expect it can be applied to most architectures used in practice.

I'm probably not going to turn this into a paper myself soon. At the moment, I'm pursuing threads which I think are much more promising - in particular, thinkin... (read more)

1Daniel Kokotajlo2moDoesn't sound like a job for me, but would you consider e.g. getting a grant to hire someone to coauthor this with you? I think the "getting a grant" part would not be the hard part.
Fixing The Good Regulator Theorem

That's the right question to ask. Conant & Ashby intentionally leave both the type signature and the causal structure of the regulator undefined - they have a whole spiel about how it can apply to multiple different setups (though they fail to mention that in some of those setups - e.g. feedback control - the content of the theorem is trivial).

For purposes of my version of the theorem, the types of the variables themselves don't particularly matter, as long as the causal structure applies. The proofs implicitly assumed that the variables have finitely many values, but of course we can get around that by taking limits, as long as we're consistent about our notion of "minimal information".

Evolution of Modularity

The material here is one seed of a worldview which I've updated toward a lot more over the past year. Some other posts which involve the theme include Science in a High Dimensional World, What is Abstraction?, Alignment by Default, and the companion post to this one Book Review: Design Principles of Biological Circuits.

Two ideas unify all of these:

  1. Our universe has a simplifying structure: it abstracts well, implying a particular kind of modularity.
  2. Goal-oriented systems in our universe tend to evolve a modular structure which reflects the structure of the u
... (read more)
But exactly how complex and fragile?

This is , right? And then you might just constrain the subset of W which the agent can search over?

Exactly.

One toy model to conceptualize what a "compact criterion" might look like: imagine we take a second-order expansion of u around some u-maximal world-state . Then, the eigendecomposition of the Hessian of u around  tells us which directions-of-change in the world state u cares about a little or a lot. If the constraints lock the accessible world-states into the directions which u doesn't care about much (i.e. eigenvalu... (read more)

But exactly how complex and fragile?

So one example would be, fix an EU maximizer. To compute value sensitivity, we consider the sensitivity of outcome value with respect to a range of feasible perturbations to the agent's utility function. The perturbations only affect the utility function, and so everything else is considered to be part of the dynamics of the situation. You might swap out the EU maximizer for a quantilizer, or change the broader society in which the agent is deployed, but these wouldn't classify as 'perturbations' in the original ontology.

Let me know if this is what you're ... (read more)

3Alex Turner3moYes, this is basically what I had in mind! I really like this grounding; thanks for writing it out. If there were a value fragility research agenda, this might be a good start; I haven't yet decided whether I think there are good theorems to be found here, though. Can you expand on This ismaxw∈Wu(w), right? And then you might just constrain the subset of W which the agent can search over? Or did you have something else in mind?
But exactly how complex and fragile?

In other words: "against which compact ways of generating perturbations is human value fragile?". But don't you still need to consider some dynamics for this question to be well-defined?

Not quite. If we frame the question as "which compact ways of generating perturbations", then that's implicitly talking about dynamics, since we're asking how the perturbations were generated. But if we know what perturbations are generated, then we can say whether human value is fragile against those perturbations, regardless of how they're generated. So, rather than ... (read more)

3Alex Turner3mo(I meant to say 'perturbations', not 'permutations') Hm, maybe we have two different conceptions. I've been imagining singling out a variable (e.g. the utility function) and perturbing it in different ways, and then filing everything else under the 'dynamics.' So one example would be, fix an EU maximizer. To compute value sensitivity, we consider the sensitivity of outcome value with respect to a range of feasible perturbations to the agent's utility function. The perturbations only affect the utility function, and so everything else is considered to be part of the dynamics of the situation. You might swap out the EU maximizer for a quantilizer, or change the broader society in which the agent is deployed, but these wouldn't classify as 'perturbations' in the original ontology. Point is, these perturbations aren't actually generated within the imagined scenarios, but we generate them outside of the scenarios in order to estimate outcome sensitivity. Perhaps this isn't clean, and perhaps I should rewrite parts of the review with a clearer decomposition.
But exactly how complex and fragile?

I read through the first part of this review, and generally thought "yep, this is basically right, except it should factor out the distance metric explicitly rather than dragging in all this stuff about dynamics". I had completely forgotten that I said the same thing a year ago, so I was pretty amused when I reached the quote.

Anyway, I'll defend the distance metric thing a bit here.

But what exactly happens between "we write down something too distant from the 'truth'" and the result? The AI happens. But this part, the dynamics, it's kept invisible.

I claim ... (read more)

3Alex Turner3moIn other words: "against which compact ways of generating perturbations is human value fragile?". But don't you still need to consider some dynamics for this question to be well-defined? So it doesn't seem like it captures all of the regularities implied by: But I do presently agree that it's a good conceptual handle for exploring robustness against different sets of perturbations.
The Pointers Problem: Clarifications/Variations

Great post!

I especially like "try to maximize values according to models which, according to human beliefs, track the things we care about well". I ended up at a similar point when thinking about the problem. It seems like we ultimately have to use this approach, at some level, in order for all the type signatures to line up. (Though this doesn't rule out entirely different approaches at other levels, as long as we expect those approaches to track the things we care about well.)

On amplified values, I think there's a significant piece absent from the discus... (read more)

Selection vs Control

The initial state of the program/physical computer may not overlap with the target space at all. The target space wouldn't be larger or smaller (in the sense of subsets); it would just be an entirely different set of states.

Flint's notion of optimization, as I understand it, requires that we can view the target space as a subset of the initial space.

Load More