All of Charlie Steiner's Comments + Replies

Sharing Powerful AI Models

One issue is that good research tools are hard to build, and organizations may be reluctant to share them (especially since making good research tools public-facing is even more effort.). Like, can I go out and buy a subscription to Anthropic's interpretability tools right now? That seems to be the future Toby (whose name, might I add, is highly confusable with Justin Shovelain's) is pushing for.

1A Ray4dIt does seem that public/shared investment into tools that make structured access programs easier, might make more of them happen. As boring as it is, this might be a good candidate for technical standards for interoperability/etc.
Truthful LMs as a warm-up for aligned AGI

Sure - another way of phrasing what I'm saying is that I'm not super interested (as alignment research, at least) in adversarial training that involves looking at difficult subsets of the training distribution, or adversarial training where the proposed solution is to give the AI more labeled examples that effectively extend the training distribution to include the difficult cases.

It would be bad if we build an AI that wasn't robust on the training distribution, of course, but I think of this as a problem already being addressed by the field of ML without any need for looking ahead to AGI.

Truthful LMs as a warm-up for aligned AGI

I think there are different kinds of robustness, and people focused on present-day applications (including tests that are easy to do in the present) are going to focus on the kinds of robustness that help with present-day problems. Being robust to malicious input from human teenagers will only marginally help make you robust to input from a future with lots of extra technological progress. They might have very different-looking solutions, because of factors like interpolation vs. extrapolation.

Framing it this way suggests one concrete thing I might hope fo... (read more)

1Jacob Hilton10dI think this is included in what I intended by "adversarial training": we'd try to find tasks that cause the model to produce negligent falsehoods, train the model to perform better at those tasks, and aim for a model that is robust to someone searching for such tasks.
Truthful LMs as a warm-up for aligned AGI

Here's my worry.

If we adopt a little bit of deltonian pessimism (though not the whole hog), and model present-day language models as doing something vaguely like nearest-neighbor interpolation in a slightly sophisticated latent space (while still being very impressive), then we might predict that there are going to be some ways of getting honest answers an impressive percentage of the time while staying entirely within the interpolation regime.

And then if you look at the extrapolation regime, it's basically the entire alignment problem squeezed into a smal... (read more)

3Jacob Hilton10dI can think of a few different interpretations of your concern (and am interested to hear if these don't cover it): * There will be insufficient attention paid to robustness. * There will be insufficient attention paid to going beyond naive human supervision. * The results of the research will be misinterpreted as representing more progress than is warranted. I agree that all of these are possibilities, and that the value of the endeavor could well depend on whether the people conducting (and communicating) the research are able to avoid pitfalls such as these. There's certainly more object-level discussion to be had about how much emphasis should be placed on avoiding these particular pitfalls, and I'm happy to dig in to them further if you're able to clarify which if any of them capture your main concern.
Different way classifiers can be diverse

It seems like there must be some decent ways to see how different two classifiers are, but I can only think of unprincipled things.

Two ideas:

Sample a lot of items and use both models to generate two rankings of the items (or log odds or some other score). Models that give similar scores to lots of examples are probably pretty similar. One problem with this is that optimizing for it when the problem is too easy will train your model to solve the problem a random way and then invert the ordering within the classes. (A similar solution with a similar problem ... (read more)

ARC's first technical report: Eliciting Latent Knowledge

When you say "some case in which a human might make different judgments, but where it's catastrophic for the AI not to make the correct judgment," what I hear is "some case where humans would sometimes make catastrophic judgments."

I think such cases exist and are a problem for the premise that some humans agreeing means an idea meets some standard of quality. Bumbling into such cases naturally might not be a dealbreaker, but there are some reasons we might get optimization pressure pushing plans proposed by an AI towards the limits of human judgment.

ARC's first technical report: Eliciting Latent Knowledge

I wrote some thoughts that look like they won't get posted anywhere else, so I'm just going to paste them here with minimal editing:

  • They (ARC) seem to imagine that for all the cases that matter, there's some ground-truth-of-goodness judgment the human would make if they knew the facts (in a fairly objective way that can be measured by how well the human does at predicting things), and so our central challenge is to figure out how to tell the human the facts (or predict what the human would say if they knew all the facts).
  • In contrast, I don't think there's
... (read more)
2Paul Christiano25dGenerally we are asking for an AI that doesn't give an unambiguously bad answer, and if there's any way of revealing the facts where we think a human would (defensibly) agree with the AI, then probably the answer isn't unambiguously bad and we're fine if the AI gives it. There are lots of possible concerns with that perspective; probably the easiest way to engage with them is to consider some concrete case in which a human might make different judgments, but where it's catastrophic for our AI not to make the "correct" judgment. I'm not sure what kind of example you have in mind and I have somewhat different responses to different kinds of examples. For example, note that ELK is never trying to answer any questions of the form "how good is this outcome?"; I certainly agree that there can also be ambiguity about questions like "did the diamond stay in the room?" but it's a fairly different situation. The most relevant sections are narrow elicitation and why it might be sufficient [] which gives a lot of examples of where we think we can/can't tolerate ambiguity, and to a lesser extent avoiding subtle manipulation [] which explains how you might get a good outcome despite tolerating such ambiguity. That said, there are still lots of reasonable objections to both of those.
1Ramana Kumar25dI think the problem you're getting at here is real -- path-dependency of what a human believes on how they came to believe it, keeping everything else fixed (e.g., what the beliefs refer to) -- but I also think ARC's ELK problem is not claiming this isn't a real problem but rather bracketing (deferring) it for as long as possible. Because there are cases where ELK fails that don't have much path-dependency in them, and we can focus on solving those cases until whatever else is causing the problem goes away (and only path-dependency is left).
ARC's first technical report: Eliciting Latent Knowledge

It might be useful to think of this as an empirical claim about diamonds.

I think this statement encapsulates some worries I have.

If it's important how the human defines a property like "the same diamond," then assuming that the sameness of the diamond is "out there in the diamond" will get you into trouble - e.g. if there's any optimization pressure to find cases where the specifics of the human's model rear their head. Human judgment is laden with the details of how humans model the world, you can't avoid dependence on the human (and the messiness that en... (read more)

Demanding and Designing Aligned Cognitive Architectures

This isn't about "inner alignment" (as catchy as the name might be), it's just about regular old alignment.

But I think you're right. As long as the learning step "erases" the model editing in a sensible way, then I was wrong and there won't be an incentive for the learned model to compensate for the editing process.

So if you can identify a "customer gender detector" node in some complicated learned world-model, you can artificially set it to a middle value as a legitimate way to make the RL agent less sensitive to customer gender.

I'm not sure how well this... (read more)

Demanding and Designing Aligned Cognitive Architectures

The facepalm was just because if this is really all inside the same RL architecture (no extra lines leading from the world-model to an unsupervised predictive reward), then all that will happen is the learned model will compensate for the distortions.

1Koen Holtman1moNot entirely sure what you mean with your aside on 'unsupervised predictive reward'. Is this a reference to unsupervised reinforcement learning? To a human supervisor controlling a live reward signal? But on your observation that 'the learned model will compensate for distortions': this sounds familiar. Here is a discussion. INTUITION PUMPS AND INNER ALIGNMENT FAILURE It is common for people on this forum to use a teleological intuition pump which makes them fear that such compensation for distortions must somehow always happen, or is very likely to happen, as an emergent phenomenon in any advanced RL or AGI system. Inner alignment failure has become a popular term on this forum when talking about this fear. But this teleological intuition pump that 'life reward maximization, uh, finds a way' is deeply incorrect in the general case. Especially for the case of agents which are designed not to be pure reward maximizers. I have more discussion about this in the paper, where I show how you can approach figure 4 with very different intuition pumps about using incorrect maps of territories. These intuition pumps that will tell you much more clearly how and why this works. For the mathematically inclined, I also include references to hard mathematical work, which should of course take precedence over intuition. STEEL-MANNING AN INNER ALIGNMENT FAILURE ARGUMENT That being said, I'll now provide provide some failure mode analysis to show that 'the learned model will compensate for distortions' could conceivably happen, if no care is taken at all to prevent it. There are two world models in the cognitive architecture of figure 4, a blue one and a green one. The green one drives the agent's decisions. The goal of the architecture is to ensure that this green word model driving the reward-optimizing decisions is specifically incorrect. To counteract that goal, via the failure mode of 'the learned model will compensate for distortions', we have to imagine the following. W
Demanding and Designing Aligned Cognitive Architectures

I like this post, and I'm really happy to be kept apprised of what you're up to. But you can probably guess why I facepalmed when I saw the diagram with the green boxes :P

I'm not sure who's saying that AI alignment is "part of modern ML research." So I don't know if it's productive to argue for or against that. But there are definitely lots of people saying that AI alignment is part of the field of AI, and it sounds like you're disagreeing with that as well - is that right? How much would you say that this categorization is a bravery debate about what people need to hear / focus on?

3Koen Holtman1moThanks! I can think of several reasons why different people on this forum might facepalm when seeing the diagram with the green boxes. Not sure if I can correctly guess yours. Feel free to expand. Yes I am disagreeing, of sorts. I would disagree with the statement that | AI alignment research is a subset of AI research but I agree with the statement that | Some parts of AI alignment research are a subset of AI research. As argued in detail in the paper, I feel that fields outside of AI research have a lot to contribute to AI alignment, especially when it comes to correctly shaping and managing the effects of actually deployed AI systems on society. Applied AI alignment is a true multi-disciplinary problem. ON BRAVERY DEBATES IN ALIGNMENT In the paper I mostly talk about what each field has to contribute in expertise, but I believe there is definitely also a bravery debate angle here, in the game of 4-dimensional policy discussion chess. I am going to use the bravery debate definition fromhere []: I guess that a policy discussion devolving into a bravery debate is one of these 'many obstacles' which I mention above, one of the many general obstacles that stakeholders in policy discussions about AI alignment, global warming, etc will need to overcome. From what I can see, as a European reading the Internet, the whole bravery debateanti-pattern [] seems to be very big right now in the US, and it has also popped up in discussions about ethical uses of AI. AI technologists have been cast as both cowardly defenders of the status quo, and as potentially brave nonconformists who just need to be woken up, or have already woken up. There is a very interesting paper which has a lot to say about this part of the narrative flow: Better, Nicer, Clearer, Fairer: A Critical Assessment of the Movement for Ethical Artificial Intelligence and Machine Learning [https://sch
Introduction To The Infra-Bayesianism Sequence

The meteor doesn't have to really flatten things out, there might be some actions that we think remain valuable (e.g. hedonism, saying tearful goodbyes).

And so if we have Knightian uncertainty about the meteor, maximin (as in Vanessa's link) means we'll spend a lot of time on tearful goodbyes.

2Diffractor1moSaid actions or lack thereof cause a fairly low utility differential compared to the actions in other, non-doomy hypotheses. Also I want to draw a critical distinction between "full knightian uncertainty over meteor presence or absence", where your analysis is correct, and "ordinary probabilistic uncertainty between a high-knightian-uncertainty hypotheses, and a low-knightian uncertainty one that says the meteor almost certainly won't happen" (where the meteor hypothesis will be ignored unless there's a meteor-inspired modification to what you do that's also very cheap in the "ordinary uncertainty" world, like calling your parents, because the meteor hypothesis is suppressed in decision-making by the low expected utility differentials, and we're maximin-ing expected utility)
Universality and the “Filter”

I am very confused. How is this better than just telling the human overseers "no, really, be conservative about implementing things that might go wrong." What makes a two-part architecture seem appealing? What does "epistemic dominance" look like in concrete terms here - what are the observables you want to dominate HCH relative to, wouldn't this be very expensive, how does this translate to buying you extra safety, etc?

Introduction To The Infra-Bayesianism Sequence

What if you assumed the stuff you had the hypothesis about was independent of the stuff you have Knightian uncertainty about (until proven otherwise)?

E.g. if you're making hypotheses about a multi-armed bandit and the world also contains a meteor that might smash through your ceiling and kill you at any time, you might want to just say "okay, ignore the meteor, pretend my utility has a term for gambling wins that doesn't depend on the meteor at all."

The reason I want to consider stuff more like this is because I don't like having to evaluate my utility fun... (read more)

2Diffractor1moSomething analogous to what you are suggesting occurs. Specifically, let's say you assign 95% probability to the bandit game behaving as normal, and 5% to "oh no, anything could happen, including the meteor". As it turns out, this behaves similarly to the ordinary bandit game being guaranteed, as the "maybe meteor" hypothesis assigns all your possible actions a score of "you're dead" so it drops out of consideration. The important aspect which a hypothesis needs, in order for you to ignore it, is that no matter what you do you get the same outcome, whether it be good or bad. A "meteor of bliss hits the earth and everything is awesome forever" hypothesis would also drop out of consideration because it doesn't really matter what you do in that scenario. To be a wee bit more mathy, probabilistic mix of inframeasures works like this. We've got a probability distribution ζ∈ΔN, and a bunch of hypotheses ψi∈□X, things that take functions as input, and return expectation values. So, your prior, your probabilistic mixture of hypotheses according to your probability distribution, would be the function f↦∑i∈Nζ(i)⋅ψi(f) It gets very slightly more complicated when you're dealing with environments, instead of static probability distributions, but it's basically the same thing. And so, if you vary your actions/vary your choice of function f, and one of the hypotheses ψi is assigning all these functions/choices of actions the same expectation value, then it can be ignored completely when you're trying to figure out the best function/choice of actions to plug in. So, hypotheses that are like "you're doomed no matter what you do" drop out of consideration, an infra-Bayes agent will always focus on the remaining hypotheses that say that what it does matters.
Introduction To The Infra-Bayesianism Sequence

Of the agent foundations work from 2020, I think this sequence is my favorite, and I say this without actually understanding it.

The core idea is that Bayesianism is too hard. And so what we ultimately want is to replace probability distributions over all possible things with simple rules that don't have to put a probability on all possible things. In some ways this is the complement to logical uncertainty - logical uncertainty is about not having to have all possible probability distributions possible, this is about not having to put probability distributi... (read more)

1DanielFilan15hI interviewed Vanessa here [] in an attempt to make this more digestible: it hopefully acts as context for the sequence, rather than a replacement for reading it.
Introduction To The Infra-Bayesianism Sequence

I'm confused about the Nirvana trick then. (Maybe here's not the best place, but oh well...) Shouldn't it break the instant you do anything with your Knightian uncertainty other than taking the worst-case?

2Vanessa Kosoy1moNotice that some non-worst-case decision rules are reducible [] to the worst-case decision rule.
2Diffractor1moWell, taking worst-case uncertainty is what infradistributions do. Did you have anything in mind that can be done with Knightian uncertainty besides taking the worst-case (or best-case)? And if you were dealing with best-case uncertainty instead, then the corresponding analogue would be assuming that you go to hell if you're mispredicted (and then, since best-case things happen to you, the predictor must accurately predict you).
The Solomonoff Prior is Malign

This was a really interesting post, and is part of a genre of similar posts about acausal interaction with consequentialists in simulatable universes.

The short argument is that if we (or not us, but someone like us with way more available compute) try to use the Kolmogorov complexity of some data to make a decision, our decision might get "hijacked" by simple programs that run for a very very long time and simulate aliens who look for universes where someone is trying to use the Solomonoff prior to make a decision and then based on what decision they want,... (read more)

Vanessa Kosoy's Shortform

if we assume our initial implementation to be flawed in a way that destroys alignment, why wouldn’t it also be flawed in a way that destroys corrigibility?

I think the people most interested in corrigibility are imagining a situation where we know what we're doing with corrigibility (e.g. we have some grab-bag of simple properties we want satisfied), but don't even know what we want from alignment, and then they imagine building an unaligned slightly-sub-human AGI and poking at it while we "figure out alignment."

Maybe this is a strawman, because the thin... (read more)

3Vanessa Kosoy1moThe concept of corrigibility was introduced by MIRI, and I don't think that's their motivation? On my model of MIRI's model, we won't have time to poke at a slightly subhuman AI, we need to have at least a fairly good notion of what to do with a superhuman AI upfront. Maybe what you meant is "we won't know how to construct perfect-utopia-AI, so we will just construct a prevent-unaligned-AIs-AI and run it so that we can figure out perfect-utopia-AI in our leisure". Which, sure, but I don't see what it has to do with corrigibility. Corrigibility is neither necessary nor sufficient for safety. It's not strictly necessary because in theory an AI can resist modifications in some scenarios while always doing the right thing (although in practice resisting modifications is an enormous red flag), and it's not sufficient since an AI can be "corrigible" but cause catastrophic harm before someone notices and fixes it. What we're supposed to gain from corrigibility is having some margin of error around alignment, in which case we can decompose alignment as corrigibility + approximate alignment. But it is underspecified if we don't say along which dimensions or how big the margin is. If it's infinite margin along all dimensions then corrigibility and alignment are just isomorphic and there's no reason to talk about the former.
Summary of the Acausal Attack Issue for AIXI

I still feel like there's just too many pigeons and not enough holes.

Like, if you're an agent in some universe with complexity K(U) and you're located by a bridging rule with complexity K(B), you are not an agent with complexity K(U). Average case you have complexity (or really you think the world has some complexity) K(U)+K(B) minus some small constant. We can illustrate this fact by making U simple and B complicated - like locating a particular string within the digits of pi.

And if an adversary in a simple universe (complexity K(U')) "hijacks" you by ins... (read more)

Redwood's Technique-Focused Epistemic Strategy

In terms of how this strategy breaks, I think there's a lot of human guidance required to avoid either trying variations on the same not-quite-right ideas over and over, or trying a hundred different definitely-not-right ideas.

Given comfort and inertia, I expect the average research group to need impetus towards mixing things up. And they're smart people, so I'm looking forward to seeing what they do next.

Understanding Gradient Hacking

I think this is a totally fine length. But then, I would :P

I still feel like this was a little gears-light. Do the proposed examples of gradient hacking really work if you make a toy neural network with them? (Or does gradient descent find a way around the apparent local minimum?)

A Possible Resolution To Spurious Counterfactuals

Just leaving a sanity check, even though I'm not sure about what the people who were more involved at the time are thinking about the 5 and 10 problem these days:

Yes, I agree this works here. But it's basically CDT - this counterfactual is basically a causal do() operator. So it might not work for other problems that proof-based UDT was intended to solve in the first place, like the absent-minded driver, the non-anthropic problem, or simulated boxing.

1Josh Hickman2moIt seems like you could use these counterfactuals to do whatever decision theory you'd like? My goal wasn't to solve actually hard decisions -- the 5 and 10 problem is perhaps the easiest decision I can imagine -- but merely to construct a formalism such that even extremely simple decisions involving self-proofs can be solved at all. I think the reason this seems to imply a decision theory is that it's such a simple model that there are some ways of making decisions that are impossible in the model -- a fair portion of that was inherited from the psuedocode in the Embedded Agency paper. I have an extension of the formalism in mind that allows an expression of UDT as well (I suspect. Or something very close to it. I haven't paid enough attention to the paper yet to know for sure). I would love to hear your thoughts once I get that post written up? :)
Introduction to inaccessible information

If the information won't fit into human ways of understanding the world, then we can't name what it is we're missing. This always makes examples of "inaccessible information" feel weird to me - like we've cheated by even naming the thing we want as if it's somewhere in the computer, and instead our first step should be to design a system that at all represents the thing we want.

Conversation on technology forecasting and gradualism

I think you could actually predict that nukes wouldn't destroy the planet in 1800 (or at least 1810), and that it would be large organizations rather than mad scientists who built them.

The reasoning for not destroying the earth is similar to the argument that the LHC won't destroy the earth. The LHC is probably fine because high energy cosmic rays hit us all the time and we're fine. Is this future bomb dangerous because it creates a chain reaction? Meteors hit us and volcanos erupt without creating chain reactions. Is this bomb super-dangerous because it c... (read more)

Theoretical Neuroscience For Alignment Theory

Part of what makes me skeptical of the logic "we have seen humans who we trust, so the same design space probably has decent density of superhumans who we'd trust" is that I'm not sold on the the (effective) orthogonality thesis for human brains. Our cognitive limitations seem like they're an active ingredient in our conceptual/moral development. We might easily know how to get human-level brain-like AI to be trustworthy but never know how to get the same design with 10x the resources to be trustworthy.

3Steve Byrnes2moThere are humans with a remarkable knack for coming up with new nanotech inventions. I don't think they (we?) have systematically different and worse motivations than normal humans. If they had even more of a remarkable knack—outside the range of humans—I don't immediately see what would go wrong. If you personally had more time to think and reflect, and more working memory and attention span, would you be concerned about your motivations becoming malign? (We might be having one of those silly arguments where you say "it might fail, we would need more research" and I say "it might succeed, we would need more research", and we're not actually disagreeing about anything.)
Theoretical Neuroscience For Alignment Theory

Nice post!

I think one of the things that average AI researchers are thinking about brains is that humans might not be very safe for other humans (link to Wei Dai post). I at least would pretty strongly disagree with "The brain is a totally aligned general intelligence."

I really like the thought about empathy as an important active ingredient in learning from other peoples' experience. It's very cool. It sort of implies an evo-psych story about the incentives for empathy that I'm not sure is true - what's the prevalence of empathy in social but non-general ... (read more)

4Steve Byrnes2moThere's a weaker statement, "there exist humans who have wound up with basically the kinds of motivations that we would want an AGI to have". For example, Eliezer endorses a statement kinda like that here [] (and he names names—Carl Shulman & Paul Christiano). If you believe that weaker statement, it suggests that we're mucking around in a generally-promising space, but that we still have work to do. Note that motivations come from a combination of algorithms and "training data" / "life experience", both of which are going to be hard or impossible to match perfectly between humans and AGIs. The success story requires having enough understanding to reconstruct the important-for-our-purposes aspects.
Some thoughts on why adversarial training might be useful

What does wanting to use adversarial training say about the sorts of labeled and unlabeled data we have available to us? Are there cases where we could either do adversarial training, or alternately train on a different (potentially much more expensive) dataset?

What does it say about our loss function on different sorts of errors? Are there cases where we could either do adversarial training, or just use a training algorithm that uses a different loss function?

Biology-Inspired AGI Timelines: The Trick That Never Works

Yeah, the nonlinearity means it's hard to know what question to ask.

If we just eyeball the graph and say that the Elo is log(log(compute)) + time (I'm totally ignoring constants here), and we assume that compute =  so that conveniently , then . The first term is from compute and the second from software. And so our history is totally not scale-free! There's some natural timescale set by , before which chess progress was dominated by compute and after which chess progress will be (was?) dominated by sof... (read more)

Integrating Three Models of (Human) Cognition

Good points, thanks for the elaboration. I agree it could also be the case that integrating thoughts with different locations of origin only happens by broadcasting both separately and then only later synthesizing them with some third mechanism (is this something we can probe by having someone multitask in an fMRI and looking for rapid strobe-light alternations of [e.g.] "count to 10"-related and "do the hand jive"-related activations?).

In a modus ponens / modus tollens sort of way, such a non-synthesizing GNW would be less useful to understanding consciou... (read more)

Discussion: Objective Robustness and Inner Alignment Terminology

Just re-read this because you cited it recently, and I like it even more the second time :)

I also like an intermediate point between the changes you lay out: keeping the "old style" tree diagram that puts outer alignment and objective robustness together under "intent alignment," but changing the interpretation of these boxes to your "new style" version where outer alignment is less impressive / stringent and robustness is more central.

Integrating Three Models of (Human) Cognition

Hm, I'm not so sure about this take on GNW (I'm not saying you're inaccurate about the literature, I just feel disagreeable).

To illustrate my reservations: soon after I read the sentence about GNW meaning you can only be conscious of one thing at a time, as I was considering that proposition, I felt my chin was a little itchy and so I scratched it. So now I can remember thinking about the proposition while simultaneously scratching my chin. Trying to recall exactly what I was thinking at the time now also brings up a feeling of a specific body posture.

This... (read more)

1Jack Koch2moTo me, "thinking about the proposition while simultaneously scratching my chin" sounds like a separate "thing" (complex representation formed in the GNW) than either "think about proposition" or "scratch my chin"... and you experienced this thing after the other ones, right? Like, from the way you described it, it sounds to me like there was actually 1) the proposition 2) the itch 3) the output of a 'summarizer' that effectively says "just now, I was considering this proposition and scratching my chin" [] . [I guess, in this sense, I would say you are ordinarily doing some "weird self-deceptive dance" that prevents you from noticing this, because most people seem to ordinarily experience "themselves" as the locus of/basis for experience, instead of there being a stream of moments of consciousness, some of which apparently refer to an 'I'.] Also, I have this sense that you're chunking your experience into "things" based on what your metacognitive summarizer-of-mental-activity is outputting back to the workspace, but there are at least 10 representations streaming through the workspace each second, and many of these will be far more primitive than any of the "things" we've already mentioned here (or than would ordinarily be noticed by the summarizer without specific training for it, e.g. in meditation). Like, in your example, there were visual sensations from the reading, mental analyses about its content, the original raw sensation of the itch, the labeling of it as "itchy," the intention to scratch the itch, (definitely lots more...), and, eventually, the thought "I remember thinking about this proposition and scratching my chin 'at the same time'."
P₂B: Plan to P₂B Better

Saying that resource acquisition is in the service of improved planning (because it makes future plans better) seems like a bit of a stretch - you could just as easily say that improved planning is in the service of resource acquisition (because it lets you use resources you couldn't before). "But executing plans is how you get the goal!" you might say, and "But using your resources is how you get to the goal!" is the reply.

Maybe this is nitpicking, because I agree with you that there is some central thing going on here that is the same whatever you choose to call "more fundamental." Some essence of getting to the goal, even though the world is bigger than me. So I'm looking forward to where this is headed.

Biology-Inspired AGI Timelines: The Trick That Never Works

Which examples are you thinking of? Modern Stockfish outperformed historical chess engines even when using the same resources, until far enough in the past that computers didn't have enough RAM to load it.

I definitely agree with your original-comment points about the general informativeness of hardware, and absolutely software is adapting to fit our current hardware. But this can all be true even if advances in software can make more than 20 orders of magnitude difference in what hardware is needed for AGI, and are much less predictable than advances in hardware rather than being adaptations in lockstep with it.

6Paul Christiano2moHere are the graphs from Hippke (he or I should publish summary at some point, sorry). I wanted to compare Fritz (which won WCCC in 1995) to a modern engine to understand the effects of hardware and software performance. I think the time controls for that tournament are similar to SF STC I think. I wanted to compare to SF8 rather than one of the NNUE engines to isolate out the effect of compute at development time and just look at test-time compute. So having modern algorithms would have let you win WCCC while spending about 50x less on compute than the winner. Having modern computer hardware would have let you win WCCC spending way more than 1000x less on compute than the winner. Measured this way software progress seems to be several times less important than hardware progress despite much faster scale-up of investment in software. But instead of asking "how well does hardware/software progress help you get to 1995 performance?" you could ask "how well does hardware/software progress get you to 2015 performance?" and on that metric it looks like software progress is way more important because you basically just can't scale old algorithms up to modern performance. The relevant measure varies depending on what you are asking. But from the perspective of takeoff speeds, it seems to me like one very salient takeaway is: if one chess project had literally come back in time with 20 years of chess progress, it would have allowed them to spend 50x less on compute than the leader. ETA: but note that the ratio would be much more extreme for Deep Blue, which is another reasonable analogy you might use.
Measuring hardware overhang

On the one hand this is an interesting and useful piece of data on AI scaling and the progress of algorithms. It's also important because it makes the point that the very notion of "progress of algorithms" implies hardware overhang as important as >10 years of Moore's law. I also enjoyed the follow-up work that this spawned in 2021.

Biology-Inspired AGI Timelines: The Trick That Never Works

This was super interesting. 

I don't think you can directly compare brain voltage to Landauer limit, because brains operate chemically, so we also care about differences in chemical potential (e.g. of sodium vs potassium, which are importantly segregated across cell membranes even though both have the same charge). To really illustrate this, we might imagine information-processing biology that uses no electrical charges, only signalling via gradients of electrically-neutral chemicals. I think this raises the total potential relative to Landauer and cuts down the amount of molecules we should estimate as transported per signal.

Biology-Inspired AGI Timelines: The Trick That Never Works

Wow, I'd forgotten about that prediction dataset! It seems like there's only even semi-decent data since 1994, but since then there does seem to be a plausible ~35-year median in the recorded points (even though, or perhaps because, the sampled distribution has been changing over time).

Biology-Inspired AGI Timelines: The Trick That Never Works

The chess link maybe should go to hippke's work. What you can see there is that a fixed chess algorithm takes an exponentially growing amount of compute and transforms it into logarithmically-growing Elo. Similar behavior features in recent pessimistic predictions of deep learning's future trajectory.

If general navigation of the real world suffers from this same logarithmic-or-worse penalty when translating hardware into performance metrics, then (perhaps surprisingly) we can't conclude that hardware is the dominant driver of progress by noticing that the ... (read more)

1CarlShulman2moBut new algorithms also don't work well on old hardware. That's evidence in favor of Paul's view that much software work is adapting to exploit new hardware scales.
My take on higher-order game theory

I have a question about this entirely divorced from practical considerations. Can we play silly ordinal games here?

If you assume that the other agent will take the infinite-order policy, but then naively maximize your expected value rather than unrolling the whole game-playing procedure, this is sort of like . So I guess my question is, if you take this kind of dumb agent (that still has to compute the infinite agent) as your baseline and then re-build an infinite tower of agents (playing other agents of the same level) on top of it, does it reconv... (read more)

1Nisan2moI think you're saying Aω+1:=[ΔAω→ΔA0], right? In that case, since A0 embeds into Aω, we'd have Aω+1 embedding into Aω. So not really a step up. If you want to play ordinal games, you could drop the requirement that agents are computable / Scott-continuous. Then you get the whole ordinal hierarchy. But then we aren't guaranteed equilibria in games between agents of the same order. I suppose you could have a hybrid approach: Order ω+1 is allowed to be discontinuous in its order-ω beliefs, but higher orders have to be continuous? Maybe that would get you to ω2.
Corrigibility Can Be VNM-Incoherent

So we have a switch with two positions, "R" and "L."

When the switch is "R," the agent is supposed to want to go to the right end of the hallway, and vice versa for "L" and left. It's not that you want this agent to be uncertain about the "correct" value of the switch and so it's learning more about the world as you send it signals - you just want the agent to want to go to the left when the switch is "L," and to the right when the switch is "R."

If you start with the agent going to the right along this hallway, and you change the switch to "L," and then a m... (read more)

Corrigibility Can Be VNM-Incoherent

I think instrumental convergence should still apply to some utility functions over policies, specifically the ones that seem to produce "smart" or "powerful" behavior from simple rules. But I don't know how to formalize this or if anyone else has.

2Alex Turner2moI share an intuition in this area, but "powerful" behavior tendencies seems nearly equivalent to instrumental convergence to me. It feels logically downstream of instrumental convergence. I already have a (somewhat weak) result [] on power-seeking wrt the simplicity prior over state-based reward functions. This isn't about utility functions over policies, though.
7tailcalled2moSince you can convert a utility function over states or observation-histories into a utility function over policies (well, as long as you have a model for measuring the utility of a policy), and since utility functions over states/observation-histories do satisfy instrumental convergence, yes you are correct. I feel like in a way, one could see the restriction to defining it in terms of e.g. states as a definition of "smart" behavior; if you define a reward in terms of states, then the policy must "smartly" generate those states, rather than just yield some sort of arbitrary behavior. 🤔 I wonder if this approach could generalize TurnTrout's approach. I'm not entirely sure how, but we might imagine that a structured utility functionu(π) over policies could be decomposed intor(f(π)), wherefis the features that the utility function pays attention to, andris the utility function expressed in terms of those features. E.g. for state-based rewards, one might takefto be a model that yields the distribution of states visited by the policy, andrto be the reward function for the individual states (some sort of modification would have to be made to address the fact that f outputs a distribution but r takes in a single state... I guess this could be handled by working in the category of vector spaces and linear transformations but I'm not sure if that's the best approach in general - though sinceSetcan be embedded into this category, it surely can't hurt too much). Then the power-seeking situation boils down to that the vast majority of policiesπlead to essentially the same featuresf(π), but that there is a small set of power-seeking policies that lead to a vastly greater range of different features? And so for mostr, aπthat optimizes/satisfices/etc.r∘fwill come from this small set of power-seeking policies. I'm not sure how to formalize this. I think it won't hold for generic vector spaces, since almost all linear transformations are invertible? But it seems to me that in
Corrigibility Can Be VNM-Incoherent

Someone at the coffee hour (Viktoriya? Apologies if I've forgotten a name) gave a short explanation of this using cycles. If you imagine an agent moving either to the left or the right along a hallway, you can change its utility function in a cycle such that it repeatedly ends up in the same place in the hallway with the same utility function.

This basically eliminates expected utility (as a discounted sum of utilities of states) maximization as producing this behavior. But you can still imagine selecting a policy such that it takes the right actions in res... (read more)

2Alex Turner2moI'm not parsing this. You change the utility function, but it ends up in the same place with the same utility function? Did we change it or not? (I think simply rewording it will communicate your point to me)
3tailcalled2mo🤔 I was about to say that I felt like my approach could still be done in terms of state rewards, and that it's just that my approach violates some of the technical assumptions in the OP. After all, you could just reward for being in a state such that the various counterfactuals apply when rolling out from this state; this would assign higher utility to the blue states than the red states, encouraging corrigibility, and contradicting TurnTrout's assumption that utility would be assigned solely based on the letter. But then I realized that this introduces a policy dependence to the reward function; the way you roll out from a state depends on which policy you have. (Well, in principle; in practice some MDPs may not have much dependence on it.) The special thing about state-based rewards is that you can assign utilities to trajectories without considering the policy that generates the trajectory at all. (Which to me seems bad for corrigibility, since corrigibility depends on the reasons for the trajectories, and not just the trajectories themselves.) But now consider the following: If you have the policy, you can figure out which actions were taken, just by applying the policy to the state/history. And instrumental convergence does not apply to utility functions over action-observation histories [] . So therefore it doesn't apply to utility functions over (policies, observation histories). (I think?? At least if the set of policies is closed under replacing an action under a specified condition, and there's no Newcombian issues that creates non-causal dependencies between policies and observation histories?). So a lot of the instrumental convergence power comes from restricting the things you can consider in the utility function. u-AOH is clearly too broad, since it allows assigning utilities to arbitrary sequences of actions with identical effects, and simu
Goodhart: Endgame
  • One thing we can do to help is set up our AI to avoid taking us into weird out-of-distribution situations where my preferences are ill-defined.
  • Another thing we can do to help is have meta-preferences about how to deal with situations where my preferences are ill-defined, and have the AI learn those meta-preferences.

And in fact, "We don't actually want to go to the sort of extreme value that you can coax a model of us into outputting in weird out-of-distribution situations" is itself a meta-preference, and so we might expect something that does a good job o... (read more)

4Steve Byrnes2moThanks! I guess I was just thinking, sometimes every option is out-of-distribution, because the future is different than the past, especially when we want AGIs to invent new technologies etc. I agree that adversarially-chosen OOD hypotheticals are very problematic. I think Stuart Armstrong thinks the end goal has to be a utility function because utility-maximizers are in reflective equilibrium in a way that other systems aren't; he talks about that here [] .
Models Modeling Models

This was a whole 2 weeks ago, so all I can say for sure was that I was at least ambiguous about your point.

But I feel like I kind of gave a reply anyway - I don't think the parallel with subagents is very deep. But there's a very strong parallel (or maybe not even a parallel, maybe this is just the thing I'm talking about) with generative modeling.

Ngo and Yudkowsky on AI capability gains

Parts of this remind me of flaming my team in a cooperative game.

A key rule to remember about team chat in videogames is that chat actions are moves in the game. It might feel satisfying to verbally dunk on my teammate for a̶s̶k̶i̶n̶g̶ ̶b̶i̶a̶s̶e̶d̶ ̶̶q̶u̶e̶s̶t̶i̶o̶n̶s̶ not ganking my lane, and I definitely do it sometimes, but I do it less if I occasionally think "what chat actions can help me win the game from this state?"

This is less than maximally helpful advice in a conversation where you're not sure what "winning" looks like. And some of the more obvious implications might look like the dreaded social obeisance.

Ngo and Yudkowsky on alignment difficulty

Ngo is very patient and understanding.

Perhaps... too patient and understanding. Richard! Blink twice if you're being held against your will!


(I too would like you to write more about agency :P)

"Summarizing Books with Human Feedback" (recursive GPT-3)

Ah, yeah. I guess this connection makes perfect sense if we're imagining supervising black-box-esque AIs that are passing around natural language plans.

Although that supervision problem is more like... summarizing Alice in Wonderland if all the pages had gotten ripped out and put back in random order. Or something. But sure, baby steps

"Summarizing Books with Human Feedback" (recursive GPT-3)

I'd heard about this before, but not the alignment spin on it. This is more interesting to me from a capabilities standpoint than an alignment standpoint, so I had assumed that this was motivated by the normal incentives for capabilities research. I'd be interested if I'm in fact wrong, or if it seems more alignment-y to other people.

From an alignment perspective the main point is that the required human data does not scale with the length of the book (or maybe scales logarithmically). In general we want evaluation procedures that scale gracefully, so that we can continue to apply them even for tasks where humans can't afford to produce or evaluate any training examples.

The approach in this paper will produce worse summaries than fine-tuning a model end-to-end. In order to produce good summaries, you will ultimately need to use more sophisticated decompositions---for example, if a char... (read more)

What exactly is GPT-3's base objective?

Yeah, agreed. It's true that GPT obeys the objective "minimize the cross-entropy loss between the output and the distribution of continuations in the training data." But this doesn't mean it doesn't also obey objectives like "write coherent text", to the extent that we can tell a useful story about how the training set induces that behavior.

(It is amusing to me how our thoughts immediately both jumped to our recent hobbyhorses.)

Models Modeling Models

My suggestion is that for practically everything you say in this post, you can say a closely-analogous thing where you throw out the word "model" and just saying "the human has lots of preferences, and those preferences don't always agree with each other, especially OOD".

Yes, I'm fine with this rephrasing. But I wouldn't write a post using only the "the human has the preferences" way of speaking, because lots of different ways of thinking about the world use that same language.

This is basically a "subagent" perspective.

I think this post is pretty different... (read more)

2Steve Byrnes2moHmm. I think you missed my point… There are two different activities: ACTIVITY A: Think about how an AI will form a model of what a human wants and is trying to do. ACTIVITY B: Think about the gears underlying human intelligence and motivation. You're doing Activity A every day. I'm doing Activity B every day. My comment was trying to say: "The people like you, doing Activity A, may talk about there being multiple models which tend to agree in-distribution but not OOD. Meanwhile, the people like me, doing Activity B, may talk about subagents. There's a conceptual parallel between these two different discussions." And I think you thought I was saying: "We both agree that the real ultimate goal right now is Activity A. I'm leaving a comment that I think will help you engage in Activity A, because Activity A is the thing to do. And my comment is: (something about humans having subagents)." Does that help?
Competent Preferences

By "violate a preference," I mean that the preference doesn't get satisfied - so if the human competently prefers 2 bananas but only got 1 banana, their preference has been violated.

But maybe you mean something along the lines of "If competent preferences are really broadly predictive, then wouldn't it be even more predictive to infer the preference 'the human prefers 2 bananas except when the AI gives them 1', since that would more accurately predict how many bananas the humans gets? This would sort of paint us into a corner where it's hard to violate com... (read more)

Load More