I initially thought you were going to debate Beth Barnes.
Also, thanks for the episode :) It was definitely interesting, although I still don't have a good handle on why some people are optimistic that there aren't classes of arguments humans will "fall for" irrespective of their truth value.
One generalization I am also interested in is to learn not merely abstract objects within a big model, but entire self-contained abstract levels of description, together with actions and state transitions that move you between abstract states. E.g. not merely detecting that "the grocery store" is a sealed box ripe for abstraction, but that "go to the grocery store" is a valid action within a simplified world-model with nice properties.
This might be significantly more challenging to say something interesting about, because it depends not just on the world but on how the agent interacts with the world.
Very nice overview! Of course, I think most of the trick is crammed into that last bit :) How do you get a program to find the "common-sense" implied model of the world to use for counterfactuals.
Even when talking about how humans shouldn't always be thought of as having some "true goal" that we just need to communicate, it's so difficult to avoid talking in that way :) We naturally phrase alignment as alignment to something - and if it's not humans, well, it must be "alignment with something bigger than humans." We don't have the words to be more specific than "good" or "good for humans," without jumping straight back to aligning outcomes to something specific like "the goals endorsed by humans under reflective equilibrium" or whatever.
We need a good linguistic-science fiction story about a language with no such issues.
I am frankly skeptical that this (section 3.9 in the pretrained frozen transformer paper) will hold up to Grad Student Descent on training parameters. But hey, maybe I'm wrong and there's some nice property of the pretrained weights that can only be pushed into overfitting by finetuning.
Sure, but if you're training on less data it's because fewer parameters is worse :P
I'm not sure how your reply relates to my guess, so I'm a little worried.
If you're intending the compute comment to be in opposition to my first paragraph, then no - when finetuning a subset of the parameters, compute is not simply proportional to the size of the subset you're finetuning, because you still have to do all the matrix multiplications of the original model, both for inference and gradient propagation. I think the point for the paper only finetuning a subset was to make a scientific point, not save compute.
My edit question was just because you ... (read more)
I think it's plausible that the data dependence will act like it's 3 OOM smaller. Compute dependence will be different, though, right? Even if you're just finetuning part of the model you have to run the whole thing to do evaluation. In a sense this actually seems like the worst of both worlds (but you get the benefit from pretraining).
Edit: Actually, I'm confused why you say a smaller model needs that factor fewer steps. I thought the slope on that one was actually quite gentle. It's just that smaller models are cheap - or am I getting it wrong?
But is that true? Human behavior has a lot of information. We normally say that this extra information is irrelevant to the human's beliefs and preferences (i.e. the agential model of humans is a simplification), but it's still there.
Sure, but the interesting thing to me isn't fixed points in the input/output map, it's properties (i.e. attractors that are allowed to be large sets) that propagate from the answers seen by a human in response to their queries, into their output.
Even if there's a fixed point, you have to further prove that this fixed point is consistent - that it's actually the answer to some askable question. I feel like this is sort of analogous to Hofstadter's q-sequence.
Yeah, I agree with this. It's certainly possible to see normal human passage through time as a process with probable attractors. I think the biggest differences are that HCH is a psychological "monoculture," HCH has tiny bottlenecks through which to pass messages compared to the information I can pass to my future self, and there's some presumption that the output will be "an answer" whereas I have no such demands on the brain-state I pass to tomorrow.
If we imagine actual human imitations I think all of these problems have fairly obvious solutions, but I t... (read more)
I'd say "If an AGI is hung up on these sorts of questions [i.e. the examples I gave of statements human 'moral experts' are going to disagree about], then we've already mostly-won" is an accurate correlation, but doesn't stand up to optimization pressure. We can't mostly-win just by fine-tuning a language model to do moral discourse. I'd guess you agree?
Anyhow, my point was more: You said "you get what you can measure" is a problem because the fact of the matter for whether decisions are good or bad is hard to evaluate (therefore sandwiching is an interest... (read more)
Hm, interesting, I'm actually worried about a totally different implication of "you get what you can measure."
"If MTurkers are on average anti-abortion and your experts are on average pro-choice, what the hell will your MTurkers think about training an algorithm that tries to learn from anti-abortion folks and output pro-choice responses? Suppose you then run that same algorithm on the experts and it gives outputs in favor of legalizing infanticide - are the humans allowed to say "hold on, I don't want that," or are we just going to accept that as wha... (read more)
I think one argument running through a lot of the sequences is that the parts of "human values" which mostly determine whether AI is great or a disaster are not the sort of things humans usually think of as "moral questions". Like, these examples from your comment below:
Was it bad to pull the plug on Terry Schiavo? How much of your income should you give to charity? Is it okay to kiss your cousin twice removed? Is it a good future if all the humans are destructively copied to computers? Should we run human challenge trials for covid-19 vaccines?
If an AGI i... (read more)
(Edited for having an actual point)
You mention some general ways to get non-myopic behavior, but when it comes to myopic behavior you default to a clean, human-comprehensible agent model. I'm curious if you have any thoughts on open avenues related to training procedures that encourage myopia in inner optimizers, even if those inner optimizers are black boxes? I do seem to vaguely recall a post from one of you about this, or maybe it was Richard Ngo.
You beat me to making this comment :P Except apparently I came here to make this comment about the changed version.
"A human would only approve safe actions" is just a problem clause altogether. I understand how this seems reasonable for sub-human optimizers, but if you (now addressing Mark and Evan) think it has any particular safety properties for superhuman optimization pressure, the particulars of that might be interesting to nail down a bit better.
Re: part 1 -
Good points, I agree. Though I think you could broadly replicate the summarization result using supervised learning - the hope for using supervised learning in superhuman domains is that your model learns a dimension of variation for "goodness" that can generalize well even if you condition on "goodness" being slightly outside any of the training examples.
Re: part 2 -
What it boils down to is that my standards (and I think the practical standards) for medical advice are low, while my standards for moral advice are high (as in, you could use this... (read more)
I (conceptual person) broadly do agree that this is valuable.
It's possible that we won't need this work - that alignment research can develop AI that doesn't benefit from the same sort of work you'd do to get GPT-3 to do tricks on command. But it's also possible that this really would be practice for "the same sort of thing we want to eventually do."
My biggest concern is actually that the problem is going to be too easy for supervised learning. Need GPT-3 to dispense expert medical advice? Fine-tune it on a corpus of expert medical advice! Or for slightly ... (read more)
I'm still holding out hope for jumping straight to FAI :P Honestly I'd probably feel safer switching on a "big human" than a general CIRL agent that models humans as Boltzmann-rational.
Though on the other hand, does modern ML research already count as trying to use UFAI to learn how to build FAI?
I normally don't think of most functions as polynomials at all - in fact, I think of most real-world functions as going to zero for large values. E.g. the function "dogness" vs. "nose size" cannot be any polynomial, because polynomials (or their inverses) blow up unrealistically for large (or small) nose sizes.
I guess the hope is that you always learn even polynomials, oriented in such a way that the extremes appear unappealing?
I recently got reminded of this post. I'm not sure I agree with it, because I think we have different paradigms for AI alignment - I'm not nearly so concerned with the sort of oversight that relies on looking at the state of the computer. Though I have nothing against the sort of oversight where you write a program to tell you about what's going on with your model.
Instead, I think that anticipating the effects of QC on AI alignment is a task in prognosticating how ML is going to change if you make quantum computing available. I think the relevant killer ap... (read more)
Steve's big thoughts on alignment in the brain probably deserve a review. Component posts include https://www.lesswrong.com/posts/diruo47z32eprenTg/my-computational-framework-for-the-brain , https://www.lesswrong.com/posts/DWFx2Cmsvd4uCKkZ4/inner-alignment-in-the-brain , https://www.lesswrong.com/posts/jNrDzyc8PJ9HXtGFm/supervised-learning-of-outputs-in-the-brain
Interestingly, I think there aren't any of my posts I should recommend - basically all of them are speculative. However, I did have a post called Gricean communication and meta-preferences th... (read more)
I think this looks fine for IDA - the two problems remain the practical one of implementing Bayesian reasoning in a complicated world, and the philosophical one that probably IDA on human imitations doesn't work because humans have bad safety properties.
Hm, I thought that was what Evan called it, but maybe I misheard. Anyhow, I mean the problem where because you can model humans in different ways, we have no unique utility function. We might think of this as having not just one Best Intentional Stance, but a generalizable intentional stance with knobs and dials on it, different settings of which might lead to viewing the subject in different ways.
I call such real-world systems that can be viewed non-uniquely through the lens of the intentional stance "approximate agents."
To the extent that mesa-optimizers... (read more)
Great interview! Weird question - did Rob Miles get a sneak peek at this interview, given that he just did a video on the same paper?
The biggest remaining question I have is a followup on the question you asked "Am I a mesa-optimizer, and if so, what's my meta-objective?" You spend some time talking about lookup tables, but I wanted to hear about human-esque "agents" that seem like they do planning, but simultaneously have a very serious determination problem for their values - is Evan's idea to try to import some "solution to outer alignment" to these age... (read more)
Oh wait, are you the first author on this paper? I didn't make the connection until I got around to reading your recent post.
So when you talk about moving to a hierarchical human model, how practical do you think it is to also move to a higher-dimensional space of possible human-models, rather than using a few hand-crafted goals? This necessitates some loss function or prior probability over models, and I'm not sure how many orders of magnitude more computationally expensive it makes everything.
My first idea is, you take your common sense AI, and rather than saying "build me a spaceship, but, like, use common sense," you can tell it "do the right thing, but, like, use common sense." (Obviously with "saying" and "tell" in invisible finger quotes.) Bam, Type-1 FAI.
Of course, whether this will go wrong or not depends on the specifics. I'm reminded of Adam Shimi et al's recent post that mentioned "Ideal Accomplishment" (how close to an explicit goal a system eventually gets) and "Efficiency" (how fast it gets there). If you have a general purpose "co... (read more)
I'm a lot less excited about the literature of the world's philosophy than I am about the living students of it.
Of course, there are some choices in designing an AI that are ethical choices, for which there's no standard by which one culture's choice is better than another's. In this case, incorporating diverse perspectives is "merely" a fair way to choose how to steer the future - a thing to do because we want to, not because it solves some technical problem.
But there are also philosophical problems faced in the construction of AI that are technical probl... (read more)
Children learn to follow common sense, despite not having (explicit) meta-ethical and meta-normative beliefs at all.
Children also learn right from wrong - I'd be interested in where you draw the line between "An AI that learns common sense" and "An AI that learns right from wrong." (You say this argument doesn't apply in the case of human values, but it seems like you mean only explicit human values, not implicit ones.)
My suspicion, which is interesting to me so I'll explain it even if you're going to tell me that I'm off base, is that you're thinking that... (read more)
The little quizzes were highly effective in getting me to actually read the post :)
I think depending on what position you take, there are difference in how much one thinks there's "room for a lot of work in this sphere." The more you treat goal-directedness as important because it's a useful category in our map for predicting certain systems, the less important it is to be precise about it. On the other hand if you want to treat goal-directedness in a human-independent way or otherwise care about it "for its own sake" for some reason, then it's a different story.
Good question. There's a big roadblock to your idea as stated, which is that asking something to define "alignment" is a moral question. But suppose we sorted out a verbal specification of an aligned AI and had a candidate FAI coded up - could we then use Debate on the question "does this candidate match the verbal specification?"
I don't know - I think it still depends on how bad humans are as judges of arguments - we've made the domain more objective, but maybe there's some policy of argumentation that still wins by what we would consider cheating. I can ... (read more)
I think the Go example really gets to the heart of why I think Debate doesn't cut it.
The reason Go is hard is that it has a large game tree despite simple rules. When we treat an AI game as information about the value of a state of the Go board, we know exactly what the rules are and how the game should be scored, the superhuman work the AIs are doing is in searching this game tree that's too big for us. The adversarial gameplay provides a check that the search through the game tree is actually finding high-scoring policies.
What does this framework need to... (read more)
Re: non-agenty AGI. The typical problem is that there are incentives for individual actors to build AI systems that pursue goals in the world. So even if you postulate non-agenty AGI, you then have to further figure out why nobody has asked the Oracle AI "What's the code to an AI that will make me rich?" or asked it for the motor output of a robot given various sense data and natural-language goals, then used that output to control a robot (also see https://slatestarcodex.com/2020/01/06/a-very-unlikely-chess-game/ ).
I'm reminded a bit of the reason why Sudoku and quantum computing are difficult: the possibilities you have to track are not purely local, they can be a nonlocal combination of different things. General NNs seem like they'd be at least NP to interpret.
But this is what dropout is useful for, penalizing reliance on correlations. So maybe if you're having trouble interpreting something you can just crank up the dropout parameters. On the other hand, dropout also promotes redundancy, which might make interpretation confusing - perhaps there's something ... (read more)
Sorry for being slow :) No, I haven't read anything of Bratman's. Should I? The synopsis looks like it might have some interesting ideas but I'm worried he could get bogged down in what human planning "really is" rather than what models are useful.
I'd totally be happy to chat either here or in PMs. Full Bayesian reasoning seems tricky if the environment is complicated enough to make hierarchical planning attractive - or do you mean optimizing a model for posterior probability (the prior being something like MML?) by local search?
I think one interesting que... (read more)
What does it mean to have an agent in the information-state?
What does it mean to have an agent in the information-state?
Nevermind, I think I was just looking at it with the wrong class of reward function in mind.
Ah, okay, I see what you mean. Like how preferences are divisible into "selfish" and "worldly" components, where the selfish component is what's impacted by a future simulation of you that is about to have good things happen to it.
(edit: The reward function in AMDPs can either be analogous to "wordly" and just sum the reward calculated at individual timesteps, or analogous to "selfish" and calculated by taking the limit of the subjective distribution over parts of the history, then applying a reward function to the expected histories.)
I brought up the hist... (read more)
Great example. At least for the purposes of explaining what I mean :) The memory AMDP would just replace the states s0, s1 with the memory states [s0], [s1], [s0,s0], [s0,s1], etc. The action takes a robot in [s0] to memory state [s0,s1], and a robot in [s0,s1] to one robot in [s0,s1,s0] and another in [s0,s1,s1].
(Skip this paragraph unless the specifics of what's going on aren't obvious: given a transition distribution P(s′∗|s,π) (P being the distribution over sets of sta... (read more)
Could you expand a little on why you say that no Bayesian hypothesis captures the distribution over robot-histories at different times? It seems like you can unroll an AMDP into a "memory MDP" that puts memory information of the robot into the state, thus allowing Bayesian calculation of the distribution over states in the memory MDP to capture history information in the AMDP.
Yeah, I agree, it seems both more human-like and more powerful to have a dynamical system where models are activating other models based on something like the "lock and key" matching of neural attention. But for alignment purposes, it seems to me that we need to not only optimize models for usefulness or similarity to actual human thought, but also for how similar they are to how humans think of human thought - when we imagine an AI with the goal of doing good, we want it to have decision-making that matches our understanding of "doing good." The model in ... (read more)
Just ended up reading your paper (well, a decent chunk of it), so thanks for the pointer :)
Congrats! Here are my totally un-researched first thoughts:
Pre-1950 History: Speculation about artificial intelligence (if not necessarily in the modern sense) dates back extremely far. Brazen head, Frankenstein, the mechanical turk, R.U.R. robots. Basically all of this treats the artificial intelligence as essentially a human, though the brazen head mythology is maybe more related to deals with djinni or devils, but basically all of it (together with more modern science fiction) can be lumped into a pile labeled "exposes and shapes human intuition, but no... (read more)
Yes, the point is multiple abstraction levels (or at least multiple abstractions, ordered into levels or not). But not multiple abstractions used by humans, multiple abstractions used on humans.
If you don't agree with me on this, why didn't you reply when I spent about six months just writing posts that were all variations of this idea? Here's Scott Alexander making the basic point.
It's like... is there a True rational approximation of pi? Well, 22/7 is pretty good, but 355/113 is more precise, if harder to remember. And just 3 is really easy to remember, ... (read more)
I think that one of the problems in this post is actually easier in the real world than in the toy model.
In the toy model the AI has to succeed by maximizing the agent's True Values, which the agent is assumed to have as a unique function over its model of the world. This is a very tricky problem, especially when, as you point out, we might allow the agent's model of reality to be wrong in places.
But in the real world, humans don't have a unique set of True Values or even a unique model of the world - we're non-Cartesian, which means that when we talk abou... (read more)
I'm pretty on board with this research agenda, but I'm curious what you think about the distinction between approaches that look like finding a fixed point, and approaches that look like doing perturbation theory.
And on the assumption that you have no idea what I'm referring to, here's the link to my post.
There are a couple different directions to go from here. One way is to try to collapse the recursion. Find a single agent-shaped model of humans that is (or approximates) a fixed point of this model-ratification process (and also hopefully stays close to
Yeah, this is basically CIRL, when the human-model is smart enough to do Gricean communication. The important open problems left over after starting with CIRL are basically "how do you make sure that your model of communicating humans infers the right things about human preferences?", both due to very obvious problems like human irrationality, and also due to weirder stuff like the human intuition that we can't put complete confidence in any single model.
Very interesting. Naturalizing feedback (as opposed to directly accessing True Reward) seems like it could lead to a lot of desirable emergent behaviors, though I'm somewhat nervous about reliance on a handwritten model of what reliable feedback is.
Interesting post. Not sure if I agree with your interpretation of the "real objective" - might be better served by looking for stable equilibria and just calling them as such.
Don't we already have weak alignment to arbitrary functions using annealing (basically, jump at random, but jump around more/further on average when the loss is higher and lower the jumping rate over time)? The reason we don't add small annealing terms to gradient descent is entirely because of we expect them to be worse in the short term (a "strong alignment" question).
Typo in the definition of product: b cdot e should be b star e.
I also agree that direct jumps in capability due to research insight are rare. But in part I think that's just because things get tried at small scale first, and so there's always going to be some scaling-up period where the new insight gets fed more and more resources, eventually outpacing the old state of the art. From a coarse-grained perspective GPT-2 relative to your favorite LSTM model from 2018 is the "jump in capability" due to research insight, it just got there in a not-so-discontinuous way.
Maybe you're optimistic that in the future, everyone wil... (read more)
I think my biggest disagreement with the Rohin-character is about continuity. I expect there will be plenty of future events like BERT vs. RNNs. BERT isn't all that much better than RNNs on small datasets, but it scales better - so then OpenAI comes along and dumps 100x or 10,000x the compute into something like it just to see what happens.
Not only do these switches make me less confident that capability won't have sudden jumps, but I think they pose a problem for carry safety properties over from the past to the future. Right now, DeepMind solves some ali... (read more)
so then OpenAI comes along and dumps 100x or 10,000x the compute into something like it just to see what happens.
10000x would be unprecedented -- why wouldn't you first do a 100x run to make sure things work well before doing a 10000x run? (This only increases costs by 1%.)
Also, 10000x increase in compute corresponds to 100-1000x more parameters, which does not usually lead to things I would call "discontinuities" (e.g. GPT-2 to GPT-3 does not seem like an important discontinuity to me, even if we ignore the in-between models trained along the way). Put an... (read more)