One issue is that good research tools are hard to build, and organizations may be reluctant to share them (especially since making good research tools public-facing is even more effort.). Like, can I go out and buy a subscription to Anthropic's interpretability tools right now? That seems to be the future Toby (whose name, might I add, is highly confusable with Justin Shovelain's) is pushing for.
Sure - another way of phrasing what I'm saying is that I'm not super interested (as alignment research, at least) in adversarial training that involves looking at difficult subsets of the training distribution, or adversarial training where the proposed solution is to give the AI more labeled examples that effectively extend the training distribution to include the difficult cases.
It would be bad if we build an AI that wasn't robust on the training distribution, of course, but I think of this as a problem already being addressed by the field of ML without any need for looking ahead to AGI.
I think there are different kinds of robustness, and people focused on present-day applications (including tests that are easy to do in the present) are going to focus on the kinds of robustness that help with present-day problems. Being robust to malicious input from human teenagers will only marginally help make you robust to input from a future with lots of extra technological progress. They might have very different-looking solutions, because of factors like interpolation vs. extrapolation.
Framing it this way suggests one concrete thing I might hope fo... (read more)
Here's my worry.
If we adopt a little bit of deltonian pessimism (though not the whole hog), and model present-day language models as doing something vaguely like nearest-neighbor interpolation in a slightly sophisticated latent space (while still being very impressive), then we might predict that there are going to be some ways of getting honest answers an impressive percentage of the time while staying entirely within the interpolation regime.
And then if you look at the extrapolation regime, it's basically the entire alignment problem squeezed into a smal... (read more)
It seems like there must be some decent ways to see how different two classifiers are, but I can only think of unprincipled things.
Sample a lot of items and use both models to generate two rankings of the items (or log odds or some other score). Models that give similar scores to lots of examples are probably pretty similar. One problem with this is that optimizing for it when the problem is too easy will train your model to solve the problem a random way and then invert the ordering within the classes. (A similar solution with a similar problem ... (read more)
When you say "some case in which a human might make different judgments, but where it's catastrophic for the AI not to make the correct judgment," what I hear is "some case where humans would sometimes make catastrophic judgments."
I think such cases exist and are a problem for the premise that some humans agreeing means an idea meets some standard of quality. Bumbling into such cases naturally might not be a dealbreaker, but there are some reasons we might get optimization pressure pushing plans proposed by an AI towards the limits of human judgment.
I wrote some thoughts that look like they won't get posted anywhere else, so I'm just going to paste them here with minimal editing:
It might be useful to think of this as an empirical claim about diamonds.
I think this statement encapsulates some worries I have.
If it's important how the human defines a property like "the same diamond," then assuming that the sameness of the diamond is "out there in the diamond" will get you into trouble - e.g. if there's any optimization pressure to find cases where the specifics of the human's model rear their head. Human judgment is laden with the details of how humans model the world, you can't avoid dependence on the human (and the messiness that en... (read more)
This isn't about "inner alignment" (as catchy as the name might be), it's just about regular old alignment.
But I think you're right. As long as the learning step "erases" the model editing in a sensible way, then I was wrong and there won't be an incentive for the learned model to compensate for the editing process.
So if you can identify a "customer gender detector" node in some complicated learned world-model, you can artificially set it to a middle value as a legitimate way to make the RL agent less sensitive to customer gender.
I'm not sure how well this... (read more)
The facepalm was just because if this is really all inside the same RL architecture (no extra lines leading from the world-model to an unsupervised predictive reward), then all that will happen is the learned model will compensate for the distortions.
I like this post, and I'm really happy to be kept apprised of what you're up to. But you can probably guess why I facepalmed when I saw the diagram with the green boxes :P
I'm not sure who's saying that AI alignment is "part of modern ML research." So I don't know if it's productive to argue for or against that. But there are definitely lots of people saying that AI alignment is part of the field of AI, and it sounds like you're disagreeing with that as well - is that right? How much would you say that this categorization is a bravery debate about what people need to hear / focus on?
The meteor doesn't have to really flatten things out, there might be some actions that we think remain valuable (e.g. hedonism, saying tearful goodbyes).
And so if we have Knightian uncertainty about the meteor, maximin (as in Vanessa's link) means we'll spend a lot of time on tearful goodbyes.
I am very confused. How is this better than just telling the human overseers "no, really, be conservative about implementing things that might go wrong." What makes a two-part architecture seem appealing? What does "epistemic dominance" look like in concrete terms here - what are the observables you want to dominate HCH relative to, wouldn't this be very expensive, how does this translate to buying you extra safety, etc?
What if you assumed the stuff you had the hypothesis about was independent of the stuff you have Knightian uncertainty about (until proven otherwise)?
E.g. if you're making hypotheses about a multi-armed bandit and the world also contains a meteor that might smash through your ceiling and kill you at any time, you might want to just say "okay, ignore the meteor, pretend my utility has a term for gambling wins that doesn't depend on the meteor at all."
The reason I want to consider stuff more like this is because I don't like having to evaluate my utility fun... (read more)
Of the agent foundations work from 2020, I think this sequence is my favorite, and I say this without actually understanding it.
The core idea is that Bayesianism is too hard. And so what we ultimately want is to replace probability distributions over all possible things with simple rules that don't have to put a probability on all possible things. In some ways this is the complement to logical uncertainty - logical uncertainty is about not having to have all possible probability distributions possible, this is about not having to put probability distributi... (read more)
I'm confused about the Nirvana trick then. (Maybe here's not the best place, but oh well...) Shouldn't it break the instant you do anything with your Knightian uncertainty other than taking the worst-case?
This was a really interesting post, and is part of a genre of similar posts about acausal interaction with consequentialists in simulatable universes.
The short argument is that if we (or not us, but someone like us with way more available compute) try to use the Kolmogorov complexity of some data to make a decision, our decision might get "hijacked" by simple programs that run for a very very long time and simulate aliens who look for universes where someone is trying to use the Solomonoff prior to make a decision and then based on what decision they want,... (read more)
if we assume our initial implementation to be flawed in a way that destroys alignment, why wouldn’t it also be flawed in a way that destroys corrigibility?
if we assume our initial implementation to be flawed in a way that destroys alignment, why wouldn’t it also be flawed in a way that destroys corrigibility?
I think the people most interested in corrigibility are imagining a situation where we know what we're doing with corrigibility (e.g. we have some grab-bag of simple properties we want satisfied), but don't even know what we want from alignment, and then they imagine building an unaligned slightly-sub-human AGI and poking at it while we "figure out alignment."
Maybe this is a strawman, because the thin... (read more)
I still feel like there's just too many pigeons and not enough holes.
Like, if you're an agent in some universe with complexity K(U) and you're located by a bridging rule with complexity K(B), you are not an agent with complexity K(U). Average case you have complexity (or really you think the world has some complexity) K(U)+K(B) minus some small constant. We can illustrate this fact by making U simple and B complicated - like locating a particular string within the digits of pi.
And if an adversary in a simple universe (complexity K(U')) "hijacks" you by ins... (read more)
In terms of how this strategy breaks, I think there's a lot of human guidance required to avoid either trying variations on the same not-quite-right ideas over and over, or trying a hundred different definitely-not-right ideas.
Given comfort and inertia, I expect the average research group to need impetus towards mixing things up. And they're smart people, so I'm looking forward to seeing what they do next.
I think this is a totally fine length. But then, I would :P
I still feel like this was a little gears-light. Do the proposed examples of gradient hacking really work if you make a toy neural network with them? (Or does gradient descent find a way around the apparent local minimum?)
Just leaving a sanity check, even though I'm not sure about what the people who were more involved at the time are thinking about the 5 and 10 problem these days:
Yes, I agree this works here. But it's basically CDT - this counterfactual is basically a causal do() operator. So it might not work for other problems that proof-based UDT was intended to solve in the first place, like the absent-minded driver, the non-anthropic problem, or simulated boxing.
If the information won't fit into human ways of understanding the world, then we can't name what it is we're missing. This always makes examples of "inaccessible information" feel weird to me - like we've cheated by even naming the thing we want as if it's somewhere in the computer, and instead our first step should be to design a system that at all represents the thing we want.
I think you could actually predict that nukes wouldn't destroy the planet in 1800 (or at least 1810), and that it would be large organizations rather than mad scientists who built them.
The reasoning for not destroying the earth is similar to the argument that the LHC won't destroy the earth. The LHC is probably fine because high energy cosmic rays hit us all the time and we're fine. Is this future bomb dangerous because it creates a chain reaction? Meteors hit us and volcanos erupt without creating chain reactions. Is this bomb super-dangerous because it c... (read more)
Part of what makes me skeptical of the logic "we have seen humans who we trust, so the same design space probably has decent density of superhumans who we'd trust" is that I'm not sold on the the (effective) orthogonality thesis for human brains. Our cognitive limitations seem like they're an active ingredient in our conceptual/moral development. We might easily know how to get human-level brain-like AI to be trustworthy but never know how to get the same design with 10x the resources to be trustworthy.
I think one of the things that average AI researchers are thinking about brains is that humans might not be very safe for other humans (link to Wei Dai post). I at least would pretty strongly disagree with "The brain is a totally aligned general intelligence."
I really like the thought about empathy as an important active ingredient in learning from other peoples' experience. It's very cool. It sort of implies an evo-psych story about the incentives for empathy that I'm not sure is true - what's the prevalence of empathy in social but non-general ... (read more)
What does wanting to use adversarial training say about the sorts of labeled and unlabeled data we have available to us? Are there cases where we could either do adversarial training, or alternately train on a different (potentially much more expensive) dataset?
What does it say about our loss function on different sorts of errors? Are there cases where we could either do adversarial training, or just use a training algorithm that uses a different loss function?
Yeah, the nonlinearity means it's hard to know what question to ask.
If we just eyeball the graph and say that the Elo is log(log(compute)) + time (I'm totally ignoring constants here), and we assume that compute = et so that conveniently log(compute)=t, thenddtElo=1t+1 . The first term is from compute and the second from software. And so our history is totally not scale-free! There's some natural timescale set by t=1, before which chess progress was dominated by compute and after which chess progress will be (was?) dominated by sof... (read more)
Good points, thanks for the elaboration. I agree it could also be the case that integrating thoughts with different locations of origin only happens by broadcasting both separately and then only later synthesizing them with some third mechanism (is this something we can probe by having someone multitask in an fMRI and looking for rapid strobe-light alternations of [e.g.] "count to 10"-related and "do the hand jive"-related activations?).
In a modus ponens / modus tollens sort of way, such a non-synthesizing GNW would be less useful to understanding consciou... (read more)
Just re-read this because you cited it recently, and I like it even more the second time :)
I also like an intermediate point between the changes you lay out: keeping the "old style" tree diagram that puts outer alignment and objective robustness together under "intent alignment," but changing the interpretation of these boxes to your "new style" version where outer alignment is less impressive / stringent and robustness is more central.
Hm, I'm not so sure about this take on GNW (I'm not saying you're inaccurate about the literature, I just feel disagreeable).
To illustrate my reservations: soon after I read the sentence about GNW meaning you can only be conscious of one thing at a time, as I was considering that proposition, I felt my chin was a little itchy and so I scratched it. So now I can remember thinking about the proposition while simultaneously scratching my chin. Trying to recall exactly what I was thinking at the time now also brings up a feeling of a specific body posture.
This... (read more)
Saying that resource acquisition is in the service of improved planning (because it makes future plans better) seems like a bit of a stretch - you could just as easily say that improved planning is in the service of resource acquisition (because it lets you use resources you couldn't before). "But executing plans is how you get the goal!" you might say, and "But using your resources is how you get to the goal!" is the reply.
Maybe this is nitpicking, because I agree with you that there is some central thing going on here that is the same whatever you choose to call "more fundamental." Some essence of getting to the goal, even though the world is bigger than me. So I'm looking forward to where this is headed.
Which examples are you thinking of? Modern Stockfish outperformed historical chess engines even when using the same resources, until far enough in the past that computers didn't have enough RAM to load it.
I definitely agree with your original-comment points about the general informativeness of hardware, and absolutely software is adapting to fit our current hardware. But this can all be true even if advances in software can make more than 20 orders of magnitude difference in what hardware is needed for AGI, and are much less predictable than advances in hardware rather than being adaptations in lockstep with it.
On the one hand this is an interesting and useful piece of data on AI scaling and the progress of algorithms. It's also important because it makes the point that the very notion of "progress of algorithms" implies hardware overhang as important as >10 years of Moore's law. I also enjoyed the follow-up work that this spawned in 2021.
This was super interesting.
I don't think you can directly compare brain voltage to Landauer limit, because brains operate chemically, so we also care about differences in chemical potential (e.g. of sodium vs potassium, which are importantly segregated across cell membranes even though both have the same charge). To really illustrate this, we might imagine information-processing biology that uses no electrical charges, only signalling via gradients of electrically-neutral chemicals. I think this raises the total potential relative to Landauer and cuts down the amount of molecules we should estimate as transported per signal.
Wow, I'd forgotten about that prediction dataset! It seems like there's only even semi-decent data since 1994, but since then there does seem to be a plausible ~35-year median in the recorded points (even though, or perhaps because, the sampled distribution has been changing over time).
The chess link maybe should go to hippke's work. What you can see there is that a fixed chess algorithm takes an exponentially growing amount of compute and transforms it into logarithmically-growing Elo. Similar behavior features in recent pessimistic predictions of deep learning's future trajectory.
If general navigation of the real world suffers from this same logarithmic-or-worse penalty when translating hardware into performance metrics, then (perhaps surprisingly) we can't conclude that hardware is the dominant driver of progress by noticing that the ... (read more)
I have a question about this entirely divorced from practical considerations. Can we play silly ordinal games here?
If you assume that the other agent will take the infinite-order policy, but then naively maximize your expected value rather than unrolling the whole game-playing procedure, this is sort of like ω+1. So I guess my question is, if you take this kind of dumb agent (that still has to compute the infinite agent) as your baseline and then re-build an infinite tower of agents (playing other agents of the same level) on top of it, does it reconv... (read more)
So we have a switch with two positions, "R" and "L."
When the switch is "R," the agent is supposed to want to go to the right end of the hallway, and vice versa for "L" and left. It's not that you want this agent to be uncertain about the "correct" value of the switch and so it's learning more about the world as you send it signals - you just want the agent to want to go to the left when the switch is "L," and to the right when the switch is "R."
If you start with the agent going to the right along this hallway, and you change the switch to "L," and then a m... (read more)
I think instrumental convergence should still apply to some utility functions over policies, specifically the ones that seem to produce "smart" or "powerful" behavior from simple rules. But I don't know how to formalize this or if anyone else has.
Someone at the coffee hour (Viktoriya? Apologies if I've forgotten a name) gave a short explanation of this using cycles. If you imagine an agent moving either to the left or the right along a hallway, you can change its utility function in a cycle such that it repeatedly ends up in the same place in the hallway with the same utility function.
This basically eliminates expected utility (as a discounted sum of utilities of states) maximization as producing this behavior. But you can still imagine selecting a policy such that it takes the right actions in res... (read more)
One thing we can do to help is set up our AI to avoid taking us into weird out-of-distribution situations where my preferences are ill-defined.Another thing we can do to help is have meta-preferences about how to deal with situations where my preferences are ill-defined, and have the AI learn those meta-preferences.
And in fact, "We don't actually want to go to the sort of extreme value that you can coax a model of us into outputting in weird out-of-distribution situations" is itself a meta-preference, and so we might expect something that does a good job o... (read more)
This was a whole 2 weeks ago, so all I can say for sure was that I was at least ambiguous about your point.
But I feel like I kind of gave a reply anyway - I don't think the parallel with subagents is very deep. But there's a very strong parallel (or maybe not even a parallel, maybe this is just the thing I'm talking about) with generative modeling.
Parts of this remind me of flaming my team in a cooperative game.
A key rule to remember about team chat in videogames is that chat actions are moves in the game. It might feel satisfying to verbally dunk on my teammate for a̶s̶k̶i̶n̶g̶ ̶b̶i̶a̶s̶e̶d̶ ̶̶q̶u̶e̶s̶t̶i̶o̶n̶s̶ not ganking my lane, and I definitely do it sometimes, but I do it less if I occasionally think "what chat actions can help me win the game from this state?"
This is less than maximally helpful advice in a conversation where you're not sure what "winning" looks like. And some of the more obvious implications might look like the dreaded social obeisance.
Ngo is very patient and understanding.
Perhaps... too patient and understanding. Richard! Blink twice if you're being held against your will!
(I too would like you to write more about agency :P)
Ah, yeah. I guess this connection makes perfect sense if we're imagining supervising black-box-esque AIs that are passing around natural language plans.
Although that supervision problem is more like... summarizing Alice in Wonderland if all the pages had gotten ripped out and put back in random order. Or something. But sure, baby steps
I'd heard about this before, but not the alignment spin on it. This is more interesting to me from a capabilities standpoint than an alignment standpoint, so I had assumed that this was motivated by the normal incentives for capabilities research. I'd be interested if I'm in fact wrong, or if it seems more alignment-y to other people.
From an alignment perspective the main point is that the required human data does not scale with the length of the book (or maybe scales logarithmically). In general we want evaluation procedures that scale gracefully, so that we can continue to apply them even for tasks where humans can't afford to produce or evaluate any training examples.
The approach in this paper will produce worse summaries than fine-tuning a model end-to-end. In order to produce good summaries, you will ultimately need to use more sophisticated decompositions---for example, if a char... (read more)
Yeah, agreed. It's true that GPT obeys the objective "minimize the cross-entropy loss between the output and the distribution of continuations in the training data." But this doesn't mean it doesn't also obey objectives like "write coherent text", to the extent that we can tell a useful story about how the training set induces that behavior.
(It is amusing to me how our thoughts immediately both jumped to our recent hobbyhorses.)
My suggestion is that for practically everything you say in this post, you can say a closely-analogous thing where you throw out the word "model" and just saying "the human has lots of preferences, and those preferences don't always agree with each other, especially OOD".
Yes, I'm fine with this rephrasing. But I wouldn't write a post using only the "the human has the preferences" way of speaking, because lots of different ways of thinking about the world use that same language.
This is basically a "subagent" perspective.
I think this post is pretty different... (read more)
By "violate a preference," I mean that the preference doesn't get satisfied - so if the human competently prefers 2 bananas but only got 1 banana, their preference has been violated.
But maybe you mean something along the lines of "If competent preferences are really broadly predictive, then wouldn't it be even more predictive to infer the preference 'the human prefers 2 bananas except when the AI gives them 1', since that would more accurately predict how many bananas the humans gets? This would sort of paint us into a corner where it's hard to violate com... (read more)