All of Charlie Steiner's Comments + Replies

Conditioning Generative Models

I think this is only one horn of a dilemma.

The other horn is if the generative model reasons about the world abstractly, so that it just gives us a good guess about what the output of the AI would be if it really was in the real world (and got to see some large hash collision).

But now it seems likely that creating this generative model would require solving several tricky alignment problems so that it generalizes its abstractions to novel situations in ways we'd approve of.

Where I agree and disagree with Eliezer

Faster than gradient descent is not a selective pressure, at least if we're considering typical ML training procedures. What is a selective pressure is regularization, which functions much more like a complexity prior than a speed prior.

So (again sticking to modern day ML as an example, if you have something else in mind that would be interesting) of course there will be a cutoff in terms of speed, excluding all algorithms that don't fit into the neural net. But among algorithms that fit into the NN, the penalty on their speed will be entirely explainable ... (read more)

Where I agree and disagree with Eliezer

This seems like a good thing to keep in mind, but also sounds too pessimistic about the ability of gradient descent to find inference algorithms that update more efficiently than gradient descent.

2Richard Ngo7d
I do expect this to happen. The question is merely: what's the best predictor of how hard it is to find inference algorithms more efficient effective than gradient descent? Is it whether those inference algorithms are more complex than gradient descent? Or is it whether those inference algorithms run for longer than gradient descent? Since gradient descent is very simple but takes a long time to run, my bet is the latter: there are many simple ways to convert compute to optimisation, but few compute-cheap ways to convert additional complexity to optimization.
Let's See You Write That Corrigibility Tag

Yeah, I already said most of the things that I have a nonstandard take on, without getting into the suitcase word nature of "corrigibility" or questioning whether researching it is worth the time. Just fill in the rest with the obvious things everyone else says.

Let's See You Write That Corrigibility Tag

Disclaimer: I am not writing my full opinions. I am writing this as if I was an alien writing an encyclopedia entry on something they know is a good idea. These aliens may define the "corrigibility" and its sub-categories slightly differently than earthlings. Also, I am bad at giving things catchy names, so I've decided that whenever I need a name for something I don't know the name of, I will make something up and accept that it sounds stupid. 45 minutes go. (EDIT: Okay, partway done and having a reasonably good time. Second 45 minutes go!) (EDIT2: Ok, we... (read more)

1Charlie Steiner8d
Yeah, I already said most of the things that I have a nonstandard take on, without getting into the suitcase word nature of "corrigibility" or questioning whether researching it is worth the time. Just fill in the rest with the obvious things everyone else says.
RL with KL penalties is better seen as Bayesian inference

You mention converging to a deterministic policy is bad because of repetition, but did I miss you addressing that it's also bad because we want diversity? (Edit: now that I reread that sentence, it makes no sense. Sorry!) In some sense we don't want RL in the limit, we want something a little more aware that we want to sample from a distribution and get lots of different continuations that are all pretty good.

[Intro to brain-like-AGI safety] 14. Controlled AGI

If I wanted to play fast and loose, I would claim that our sense of ourselves as having a first-person at all is part of an evolutionary solution to the problem of learning from other peoples's experiences (wait, wasn't there a post like that recently? Or was that about empathy...). It merely seems like a black box to us because we're too good at it, precisely because it's so important.

Somehow we develop a high-level model of the world with ourselves and other people in it, and then this level of abstraction actually gets hooked up to our motivations - mak... (read more)

Open Problems in Negative Side Effect Minimization

There's definitely a tension here between avoiding bad disruptive actions and doing good disruptive actions.

It seems to me like you're thinking about SEM more like a prior that starts out dominant but can get learned away over time. Is that somewhat close to how you're thinking about this tension?

1Fabian Schimpf2mo
Starting more restrictive seems sensible; this could be, as you say, learned away, or one could use human feedback to sign off on high-impact actions. The first problem reminds me of finding regions of attractions in nonlinear control where the ROA is explored without leaving the stable region. The second approach seems to hinge on humans being able to understand the implications of high-impact actions and the consequences of a baseline like inaction. There are probably also other alternatives that we have not yet considered.
Law-Following AI 3: Lawless AI Agents Undermine Stabilizing Agreements

Decision-makers who need inspections to keep them in line are incentivized to subvert those inspections and betray the other party. It seems like what's actually necessary is people in key positions who are willing to cooperate in the prisoner's dilemma when they think they are playing against people like them - people who would cooperate even if there were no inspections.

But if there are inspections, then why do we need law-following AI? Why not have the inspections directly check that the AI would not harm the other party (hopefully because it would be helping humans in general)?

Refine: An Incubator for Conceptual Alignment Research Bets

Great news! I have to change the post I was drafting about unfilled niches :)

2Adam Shimi2mo
Sorry to make you work more, but happy to fill a much needed niche. ^^
AMA Conjecture, A New Alignment Startup

Do you expect interpretability tools developed now to extend to interpreting more general (more multimodal, better at navigating the real world) decision-making systems? How?

3Connor Leahy3mo
Yes, we do expect this to be the case. Unfortunately, I think explaining in detail why we think this may be infohazardous. Or at least, I am sufficiently unsure about how infohazardous it is that I would first like to think about it for longer and run it through our internal infohazard review before sharing more. Sorry!
AXRP Episode 14 - Infra-Bayesian Physicalism with Vanessa Kosoy

Warning: rambling.

I feel like I haven't quite clarified for myself what the cybernetic agent is doing wrong (or if you don't want to assume physicalism, just put scare quotes around "doing wrong") when it sees the all-black input. There might be simple differences that have many implications.

Suppose that all hypotheses are in terms of physical universe + bridging law. We might accuse the cybernetic agent of violating a dogma of information flow by making its physical-universe-hypothesis depend on the output of the bridging law. But it doesn't necessarily h... (read more)

Why Agent Foundations? An Overly Abstract Explanation

It's not clear to me that your metaphors are pointing at something in particular.

Revenue of a nail factory is a good proxy for the quality of the nails produced, but only within a fairly small bubble around our current world. You can't make the factory-owner too smart, or the economy too irrational, or allow for too many technological breakthroughs to happen, or else the proxy breaks. If this was all we needed, then yes, absolutely, I'm sure there's a similarly neat and simple way to instrumentalize human values - it's just going to fail if things are too ... (read more)

Revenue of a nail factory is a good proxy for the quality of the nails produced, but only within a fairly small bubble around our current world. You can't make the factory-owner too smart, or the economy too irrational, or allow for too many technological breakthroughs to happen, or else the proxy breaks.

I think you missed the point of that particular metaphor. The claim was not that revenue of a nail factory is a robust operationalization of nail value. The claim was that a competitive nail market plus nail-maker reputation tracking is a True Name for a p... (read more)

When people ask for your P(doom), do you give them your inside view or your betting odds?

Could you explain more about the difference.and what it looks like to give one vs. the other?

1Morten Hustveit3mo
When betting, you should discount the scenarios where you're unable to enjoy the reward to zero. In less accurate terms, any doom scenario that involves you personally dying should be treated as impossible, because the expected utility of winning is zero.
Why Agent Foundations? An Overly Abstract Explanation

Not sure if I disagree or if we're placing emphasis differently.

I certainly agree that there are going to be places where we'll need to use nice, clean concepts that are known to generalize. But I don't think that the resolutions to problems 1 and 2 will look like nice clean concepts (like in minimizing mutual information). It's not just human values that are messy and contingent, even the pointer we want to use to gesture to those-things-we-want-to-treat-as-our-values is messy and contingent. I think of some of my intuitions as my "real values" and others... (read more)

It's not just human values that are messy and contingent, even the pointer we want to use to gesture to those-things-we-want-to-treat-as-our-values is messy and contingent.

What's the evidence for this claim?

When I look at e.g. nails, the economic value of a nail seems reasonably complicated. Yet the "pointers to nail value" which we use in practice - i.e. competitive markets and reputation systems - do have clean, robust mathematical formulations.

Furthermore, before the mid-20th century, I expect that most people would have expected that competitive market... (read more)

ELK Thought Dump

Pragmatism's a great word, everyone wants to use it :P But to be specific, I mean more like Rorty (after some Yudkowskian fixes) than Pierce.

3Abram Demski3mo
Fair enough!
A Longlist of Theories of Impact for Interpretability

Fwiw, I do have the reverse view, but my reason is more that "auditing a trained model" does not have a great story for wins. Like, either you find that the model is fine (in which case it would have been fine if you skipped the auditing) or you find that the model will kill you (in which case you don't deploy your AI system, and someone else destroys the world instead).

The way I'd put something-like-this is that in order for auditing the model to help (directly), you have to actually be pretty confident in your ability to understand and fix your mistakes ... (read more)

A Longlist of Theories of Impact for Interpretability

Fun exercise.

I made a publicly editable google sheet with my own answers already added here (though I wrote down my answers in a text document, without more than glancing at previous answers):

Looks like I'm much more interested in interpretability as a cooperation / trust-building mechanism.

3Neel Nanda4mo
Good idea, thanks! I made a publicly editable spreadsheet for people to add their own []
ELK prize results

Bravo! Honestly the thing I'm most impressed with here is your blazing speed.

I like the "make it useful to another AI" idea, in part because I think it has interesting failure modes. The dynamic between the predictor and the user is apparently adversarial (so you might imagine that training the predictor on a fixed user will lead to the user getting deceived, while training the user on a fixed predictor leads to deceptions being uncovered). But also, there's a cooperative dynamic where given a fixed evaluation function for how well the user does, both the predictor and the user are trying to find exploits in the evaluator.

ELK Thought Dump

I think arguing against Platonism is a good segue into arguing for pragmatism. We often use the word "knowledge" different ways in different contexts, and I think that's fine.

When the context is about math we can "know" statements that are theorems of some axioms (given either explicitly or implicitly), but we can also use "know" in other ways, as in "we know P!=NP but we can't prove it."

And when the context is about the world, we can have "knowledge" that's about correspondence between our beliefs and reality. We can even use "knowledge" in a way that let... (read more)

3Abram Demski3mo
To be pedantic, "pragmatism" in the context of theories of knowledge means "knowledge is whatever the scientific community eventually agrees on" (or something along those lines -- I have not read deeply on it). [A pragmatist approach to ELK would, then, rule out "the predictor's knowledge goes beyond human science" type counterexamples on principle.] What you're arguing for is more commonly called contextualism. (The standards for "knowledge" depend on context.) I totally agree with contextualism as a description of linguistic practice, but I think the ELK-relevant question is: what notion of knowledge is relevant to reducing AI risk? (TBC, I don't think the answer to this is immediately obvious; I'm unsure which types of knowledge are most risk-relevant.)
ELK Thought Dump

I like this definition too. You might add some sort of distribution over goals (sort of like Attainable Utility) so that e.g. Alice can know things about things that she doesn't personally care about.

Alignment research exercises

This post, by example, seems like a really good argument that we should spend a little more effort on didactic posts of this sort. E.g. rather than just saying "physical systems have multiple possible interpretations," we could point people to a post about a gridworld with a deterministic system playing the role of the agent, such that there are a couple different pretty-good ways of describing this agent that mostly agree but generalize in different ways.

This perspective might also be a steelmanning of that sort of paper where there's an abstract argument... (read more)

REPL's and ELK

I can't help but feel that this sneakily avoids some of the hard parts of the problem by assuming that we know how to find certain things like "the state according to the AI/human" from the start.

ELK Proposal: Thinking Via A Human Imitator

I guess I'm still missing some of the positive story. Why should the AI optimally using the encoder-human-decoder part of the graph to do computation involve explaining clearly to the human what it thinks happens in the future?

Why wouldn't it do things like showing us a few different video clips and asking which looks most realistic to us, without giving any information about what it "really thinks" about the diamond? Or even worse, why wouldn't gradient descent learn to just treat the fixed human-simulator as a "reservoir" that it can use in unintended ways?

2Alex Turner4mo
I do think this is what happens given the current architecture. I argued that the desired outcome solves narrow ELK as a sanity check, but I'm not claiming that the desired setup is uniquely loss-minimizing. Part of my original intuition was "the human net is set up to be useful at predicting given honest information about the situation", and "pressure for simpler reporters will force some kind of honesty, but I don't know how far that goes." As time passed I became more and more aware of how this wasn't the best/only way for the human net to help with prediction, and turned more towards a search for a crisp counterexample.
ELK Proposal: Thinking Via A Human Imitator

I think the proposal is not to force the AI to funnel all its state through the human, but instead to make the human-simulation available as extra computing resources that the most-effective AIs should take advantage of.

The Big Picture Of Alignment (Talk Part 1)

Thanks a bunch!

  1. I want to interrogate a little more the notion that gradient descent samples uniformly (or rather, are dominated by the initialization distribution) from good parameters. Have you read various things about grokking like Hypothesis: GD Prefers General Crictuits? That argument seems to be that you might start with parameters dominated by the initialization distribution, but various sorts of regularization are going to push you to sample solutions in a nonuniform way. Do you have a take on this?
  2. For the power-seeking-because-of-entropy example,
... (read more)
Robert Miles's Shortform

Ah, the good ol' Alien Concepts problem. Another interesting place this motif comes up is in defining logical counterfactuals - you'd think that logical inductors would have let us define logical counterfactuals, but it turns out that what we want from logical counterfactuals is basically just to use them in planning, which requires taking into account what we want.

QNR prospects are important for AI alignment research

The main argument I see not to take this too literally is something like "conservation of computation."

I think it's quite likely that left to its own devices a black-box machine learning agent would learn to represent information in weakly hierarchical models that don't have a simple graph structure (but maybe have some more complicated graph structure where representations are connected by complicated transition functions that also have side-effects on the state of the agent). If we actually knew how to access these representations, that would already be ... (read more)

5Eric Drexler5mo
Although I don’t understand what you mean by “conservation of computation”, the distribution of computation, information sources, learning, and representation capacity is important in shaping how and where knowledge is represented. The idea that general AI capabilities can best be implemented or modeled as “an agent” (an “it” that uses “the search algorithm”) is, I think, both traditional and misguided. A host of tasks require agentic action-in-the-world, but those tasks are diverse and will be performed and learned in parallel (see the CAIS report, []). Skill in driving somewhat overlaps with — yet greatly differs from — skill in housecleaning or factory management; learning any of these does not provide deep, state-of-the art knowledge of quantum physics, and can benefit from (but is not a good way to learn) conversational skills that draw on broad human knowledge. A well-developed QNR store should be thought of as a body of knowledge that potentially approximates the whole of human and AI-learned knowledge, as well as representations of rules/programs/skills/planning strategies for a host of tasks. The architecture of multi-agent systems can provide individual agents with resources that are sufficient for the tasks they perform, but not orders of magnitude more than necessary, shaping how and where knowledge is represented. Difficult problems can be delegated to low-latency AI cloud services. . There is no “it” in this story, and classic, unitary AI agents don’t seem competitive as service providers — which is to say, don’t seem useful.. I’ve noted the value of potentially opaque neural representations (Transformers, convnets, etc.) in agents that must act skillfully, converse fluently, and so on, but operationalized, localized, task-relevant knowledge and skills complement rather than replace knowledge that is accessible by associative memory over a large, shared store.
Sharing Powerful AI Models

One issue is that good research tools are hard to build, and organizations may be reluctant to share them (especially since making good research tools public-facing is even more effort.). Like, can I go out and buy a subscription to Anthropic's interpretability tools right now? That seems to be the future Toby (whose name, might I add, is highly confusable with Justin Shovelain's) is pushing for.

1Alex Gray5mo
It does seem that public/shared investment into tools that make structured access programs easier, might make more of them happen. As boring as it is, this might be a good candidate for technical standards for interoperability/etc.
Truthful LMs as a warm-up for aligned AGI

Sure - another way of phrasing what I'm saying is that I'm not super interested (as alignment research, at least) in adversarial training that involves looking at difficult subsets of the training distribution, or adversarial training where the proposed solution is to give the AI more labeled examples that effectively extend the training distribution to include the difficult cases.

It would be bad if we build an AI that wasn't robust on the training distribution, of course, but I think of this as a problem already being addressed by the field of ML without any need for looking ahead to AGI.

Truthful LMs as a warm-up for aligned AGI

I think there are different kinds of robustness, and people focused on present-day applications (including tests that are easy to do in the present) are going to focus on the kinds of robustness that help with present-day problems. Being robust to malicious input from human teenagers will only marginally help make you robust to input from a future with lots of extra technological progress. They might have very different-looking solutions, because of factors like interpolation vs. extrapolation.

Framing it this way suggests one concrete thing I might hope fo... (read more)

1Jacob Hilton5mo
I think this is included in what I intended by "adversarial training": we'd try to find tasks that cause the model to produce negligent falsehoods, train the model to perform better at those tasks, and aim for a model that is robust to someone searching for such tasks.
Truthful LMs as a warm-up for aligned AGI

Here's my worry.

If we adopt a little bit of deltonian pessimism (though not the whole hog), and model present-day language models as doing something vaguely like nearest-neighbor interpolation in a slightly sophisticated latent space (while still being very impressive), then we might predict that there are going to be some ways of getting honest answers an impressive percentage of the time while staying entirely within the interpolation regime.

And then if you look at the extrapolation regime, it's basically the entire alignment problem squeezed into a smal... (read more)

3Jacob Hilton5mo
I can think of a few different interpretations of your concern (and am interested to hear if these don't cover it): * There will be insufficient attention paid to robustness. * There will be insufficient attention paid to going beyond naive human supervision. * The results of the research will be misinterpreted as representing more progress than is warranted. I agree that all of these are possibilities, and that the value of the endeavor could well depend on whether the people conducting (and communicating) the research are able to avoid pitfalls such as these. There's certainly more object-level discussion to be had about how much emphasis should be placed on avoiding these particular pitfalls, and I'm happy to dig in to them further if you're able to clarify which if any of them capture your main concern.
Different way classifiers can be diverse

It seems like there must be some decent ways to see how different two classifiers are, but I can only think of unprincipled things.

Two ideas:

Sample a lot of items and use both models to generate two rankings of the items (or log odds or some other score). Models that give similar scores to lots of examples are probably pretty similar. One problem with this is that optimizing for it when the problem is too easy will train your model to solve the problem a random way and then invert the ordering within the classes. (A similar solution with a similar problem ... (read more)

ARC's first technical report: Eliciting Latent Knowledge

When you say "some case in which a human might make different judgments, but where it's catastrophic for the AI not to make the correct judgment," what I hear is "some case where humans would sometimes make catastrophic judgments."

I think such cases exist and are a problem for the premise that some humans agreeing means an idea meets some standard of quality. Bumbling into such cases naturally might not be a dealbreaker, but there are some reasons we might get optimization pressure pushing plans proposed by an AI towards the limits of human judgment.

ARC's first technical report: Eliciting Latent Knowledge

I wrote some thoughts that look like they won't get posted anywhere else, so I'm just going to paste them here with minimal editing:

  • They (ARC) seem to imagine that for all the cases that matter, there's some ground-truth-of-goodness judgment the human would make if they knew the facts (in a fairly objective way that can be measured by how well the human does at predicting things), and so our central challenge is to figure out how to tell the human the facts (or predict what the human would say if they knew all the facts).
  • In contrast, I don't think there's
... (read more)
2Paul Christiano6mo
Generally we are asking for an AI that doesn't give an unambiguously bad answer, and if there's any way of revealing the facts where we think a human would (defensibly) agree with the AI, then probably the answer isn't unambiguously bad and we're fine if the AI gives it. There are lots of possible concerns with that perspective; probably the easiest way to engage with them is to consider some concrete case in which a human might make different judgments, but where it's catastrophic for our AI not to make the "correct" judgment. I'm not sure what kind of example you have in mind and I have somewhat different responses to different kinds of examples. For example, note that ELK is never trying to answer any questions of the form "how good is this outcome?"; I certainly agree that there can also be ambiguity about questions like "did the diamond stay in the room?" but it's a fairly different situation. The most relevant sections are narrow elicitation and why it might be sufficient [] which gives a lot of examples of where we think we can/can't tolerate ambiguity, and to a lesser extent avoiding subtle manipulation [] which explains how you might get a good outcome despite tolerating such ambiguity. That said, there are still lots of reasonable objections to both of those.
1Ramana Kumar6mo
I think the problem you're getting at here is real -- path-dependency of what a human believes on how they came to believe it, keeping everything else fixed (e.g., what the beliefs refer to) -- but I also think ARC's ELK problem is not claiming this isn't a real problem but rather bracketing (deferring) it for as long as possible. Because there are cases where ELK fails that don't have much path-dependency in them, and we can focus on solving those cases until whatever else is causing the problem goes away (and only path-dependency is left).
ARC's first technical report: Eliciting Latent Knowledge

It might be useful to think of this as an empirical claim about diamonds.

I think this statement encapsulates some worries I have.

If it's important how the human defines a property like "the same diamond," then assuming that the sameness of the diamond is "out there in the diamond" will get you into trouble - e.g. if there's any optimization pressure to find cases where the specifics of the human's model rear their head. Human judgment is laden with the details of how humans model the world, you can't avoid dependence on the human (and the messiness that en... (read more)

Demanding and Designing Aligned Cognitive Architectures

This isn't about "inner alignment" (as catchy as the name might be), it's just about regular old alignment.

But I think you're right. As long as the learning step "erases" the model editing in a sensible way, then I was wrong and there won't be an incentive for the learned model to compensate for the editing process.

So if you can identify a "customer gender detector" node in some complicated learned world-model, you can artificially set it to a middle value as a legitimate way to make the RL agent less sensitive to customer gender.

I'm not sure how well this... (read more)

Demanding and Designing Aligned Cognitive Architectures

The facepalm was just because if this is really all inside the same RL architecture (no extra lines leading from the world-model to an unsupervised predictive reward), then all that will happen is the learned model will compensate for the distortions.

1Koen Holtman6mo
Not entirely sure what you mean with your aside on 'unsupervised predictive reward'. Is this a reference to unsupervised reinforcement learning? To a human supervisor controlling a live reward signal? But on your observation that 'the learned model will compensate for distortions': this sounds familiar. Here is a discussion. INTUITION PUMPS AND INNER ALIGNMENT FAILURE It is common for people on this forum to use a teleological intuition pump which makes them fear that such compensation for distortions must somehow always happen, or is very likely to happen, as an emergent phenomenon in any advanced RL or AGI system. Inner alignment failure has become a popular term on this forum when talking about this fear. But this teleological intuition pump that 'life reward maximization, uh, finds a way' is deeply incorrect in the general case. Especially for the case of agents which are designed not to be pure reward maximizers. I have more discussion about this in the paper, where I show how you can approach figure 4 with very different intuition pumps about using incorrect maps of territories. These intuition pumps that will tell you much more clearly how and why this works. For the mathematically inclined, I also include references to hard mathematical work, which should of course take precedence over intuition. STEEL-MANNING AN INNER ALIGNMENT FAILURE ARGUMENT That being said, I'll now provide provide some failure mode analysis to show that 'the learned model will compensate for distortions' could conceivably happen, if no care is taken at all to prevent it. There are two world models in the cognitive architecture of figure 4, a blue one and a green one. The green one drives the agent's decisions. The goal of the architecture is to ensure that this green word model driving the reward-optimizing decisions is specifically incorrect. To counteract that goal, via the failure mode of 'the learned model will compensate for distortions', we have to imagine the following. W
Demanding and Designing Aligned Cognitive Architectures

I like this post, and I'm really happy to be kept apprised of what you're up to. But you can probably guess why I facepalmed when I saw the diagram with the green boxes :P

I'm not sure who's saying that AI alignment is "part of modern ML research." So I don't know if it's productive to argue for or against that. But there are definitely lots of people saying that AI alignment is part of the field of AI, and it sounds like you're disagreeing with that as well - is that right? How much would you say that this categorization is a bravery debate about what people need to hear / focus on?

3Koen Holtman6mo
Thanks! I can think of several reasons why different people on this forum might facepalm when seeing the diagram with the green boxes. Not sure if I can correctly guess yours. Feel free to expand. Yes I am disagreeing, of sorts. I would disagree with the statement that | AI alignment research is a subset of AI research but I agree with the statement that | Some parts of AI alignment research are a subset of AI research. As argued in detail in the paper, I feel that fields outside of AI research have a lot to contribute to AI alignment, especially when it comes to correctly shaping and managing the effects of actually deployed AI systems on society. Applied AI alignment is a true multi-disciplinary problem. ON BRAVERY DEBATES IN ALIGNMENT In the paper I mostly talk about what each field has to contribute in expertise, but I believe there is definitely also a bravery debate angle here, in the game of 4-dimensional policy discussion chess. I am going to use the bravery debate definition fromhere []: I guess that a policy discussion devolving into a bravery debate is one of these 'many obstacles' which I mention above, one of the many general obstacles that stakeholders in policy discussions about AI alignment, global warming, etc will need to overcome. From what I can see, as a European reading the Internet, the whole bravery debateanti-pattern [] seems to be very big right now in the US, and it has also popped up in discussions about ethical uses of AI. AI technologists have been cast as both cowardly defenders of the status quo, and as potentially brave nonconformists who just need to be woken up, or have already woken up. There is a very interesting paper which has a lot to say about this part of the narrative flow: Better, Nicer, Clearer, Fairer: A Critical Assessment of the Movement for Ethical Artificial Intelligence and Machine Learning [https://sch
Introduction To The Infra-Bayesianism Sequence

The meteor doesn't have to really flatten things out, there might be some actions that we think remain valuable (e.g. hedonism, saying tearful goodbyes).

And so if we have Knightian uncertainty about the meteor, maximin (as in Vanessa's link) means we'll spend a lot of time on tearful goodbyes.

Said actions or lack thereof cause a fairly low utility differential compared to the actions in other, non-doomy hypotheses. Also I want to draw a critical distinction between "full knightian uncertainty over meteor presence or absence", where your analysis is correct, and "ordinary probabilistic uncertainty between a high-knightian-uncertainty hypotheses, and a low-knightian uncertainty one that says the meteor almost certainly won't happen" (where the meteor hypothesis will be ignored unless there's a meteor-inspired modification to what you do that's also very cheap in the "ordinary uncertainty" world, like calling your parents, because the meteor hypothesis is suppressed in decision-making by the low expected utility differentials, and we're maximin-ing expected utility)
Universality and the “Filter”

I am very confused. How is this better than just telling the human overseers "no, really, be conservative about implementing things that might go wrong." What makes a two-part architecture seem appealing? What does "epistemic dominance" look like in concrete terms here - what are the observables you want to dominate HCH relative to, wouldn't this be very expensive, how does this translate to buying you extra safety, etc?

Introduction To The Infra-Bayesianism Sequence

What if you assumed the stuff you had the hypothesis about was independent of the stuff you have Knightian uncertainty about (until proven otherwise)?

E.g. if you're making hypotheses about a multi-armed bandit and the world also contains a meteor that might smash through your ceiling and kill you at any time, you might want to just say "okay, ignore the meteor, pretend my utility has a term for gambling wins that doesn't depend on the meteor at all."

The reason I want to consider stuff more like this is because I don't like having to evaluate my utility fun... (read more)

Something analogous to what you are suggesting occurs. Specifically, let's say you assign 95% probability to the bandit game behaving as normal, and 5% to "oh no, anything could happen, including the meteor". As it turns out, this behaves similarly to the ordinary bandit game being guaranteed, as the "maybe meteor" hypothesis assigns all your possible actions a score of "you're dead" so it drops out of consideration. The important aspect which a hypothesis needs, in order for you to ignore it, is that no matter what you do you get the same outcome, whether it be good or bad. A "meteor of bliss hits the earth and everything is awesome forever" hypothesis would also drop out of consideration because it doesn't really matter what you do in that scenario. To be a wee bit more mathy, probabilistic mix of inframeasures works like this. We've got a probability distribution ζ∈ΔN, and a bunch of hypotheses ψi∈□X, things that take functions as input, and return expectation values. So, your prior, your probabilistic mixture of hypotheses according to your probability distribution, would be the function f↦∑i∈Nζ(i)⋅ψi(f) It gets very slightly more complicated when you're dealing with environments, instead of static probability distributions, but it's basically the same thing. And so, if you vary your actions/vary your choice of function f, and one of the hypotheses ψi is assigning all these functions/choices of actions the same expectation value, then it can be ignored completely when you're trying to figure out the best function/choice of actions to plug in. So, hypotheses that are like "you're doomed no matter what you do" drop out of consideration, an infra-Bayes agent will always focus on the remaining hypotheses that say that what it does matters.
Introduction To The Infra-Bayesianism Sequence

Of the agent foundations work from 2020, I think this sequence is my favorite, and I say this without actually understanding it.

The core idea is that Bayesianism is too hard. And so what we ultimately want is to replace probability distributions over all possible things with simple rules that don't have to put a probability on all possible things. In some ways this is the complement to logical uncertainty - logical uncertainty is about not having to have all possible probability distributions possible, this is about not having to put probability distributi... (read more)

I interviewed Vanessa here [] in an attempt to make this more digestible: it hopefully acts as context for the sequence, rather than a replacement for reading it.
Introduction To The Infra-Bayesianism Sequence

I'm confused about the Nirvana trick then. (Maybe here's not the best place, but oh well...) Shouldn't it break the instant you do anything with your Knightian uncertainty other than taking the worst-case?

2Vanessa Kosoy6mo
Notice that some non-worst-case decision rules are reducible [] to the worst-case decision rule.
Well, taking worst-case uncertainty is what infradistributions do. Did you have anything in mind that can be done with Knightian uncertainty besides taking the worst-case (or best-case)? And if you were dealing with best-case uncertainty instead, then the corresponding analogue would be assuming that you go to hell if you're mispredicted (and then, since best-case things happen to you, the predictor must accurately predict you).
The Solomonoff Prior is Malign

This was a really interesting post, and is part of a genre of similar posts about acausal interaction with consequentialists in simulatable universes.

The short argument is that if we (or not us, but someone like us with way more available compute) try to use the Kolmogorov complexity of some data to make a decision, our decision might get "hijacked" by simple programs that run for a very very long time and simulate aliens who look for universes where someone is trying to use the Solomonoff prior to make a decision and then based on what decision they want,... (read more)

Vanessa Kosoy's Shortform

if we assume our initial implementation to be flawed in a way that destroys alignment, why wouldn’t it also be flawed in a way that destroys corrigibility?

I think the people most interested in corrigibility are imagining a situation where we know what we're doing with corrigibility (e.g. we have some grab-bag of simple properties we want satisfied), but don't even know what we want from alignment, and then they imagine building an unaligned slightly-sub-human AGI and poking at it while we "figure out alignment."

Maybe this is a strawman, because the thin... (read more)

3Vanessa Kosoy6mo
The concept of corrigibility was introduced by MIRI, and I don't think that's their motivation? On my model of MIRI's model, we won't have time to poke at a slightly subhuman AI, we need to have at least a fairly good notion of what to do with a superhuman AI upfront. Maybe what you meant is "we won't know how to construct perfect-utopia-AI, so we will just construct a prevent-unaligned-AIs-AI and run it so that we can figure out perfect-utopia-AI in our leisure". Which, sure, but I don't see what it has to do with corrigibility. Corrigibility is neither necessary nor sufficient for safety. It's not strictly necessary because in theory an AI can resist modifications in some scenarios while always doing the right thing (although in practice resisting modifications is an enormous red flag), and it's not sufficient since an AI can be "corrigible" but cause catastrophic harm before someone notices and fixes it. What we're supposed to gain from corrigibility is having some margin of error around alignment, in which case we can decompose alignment as corrigibility + approximate alignment. But it is underspecified if we don't say along which dimensions or how big the margin is. If it's infinite margin along all dimensions then corrigibility and alignment are just isomorphic and there's no reason to talk about the former.
Summary of the Acausal Attack Issue for AIXI

I still feel like there's just too many pigeons and not enough holes.

Like, if you're an agent in some universe with complexity K(U) and you're located by a bridging rule with complexity K(B), you are not an agent with complexity K(U). Average case you have complexity (or really you think the world has some complexity) K(U)+K(B) minus some small constant. We can illustrate this fact by making U simple and B complicated - like locating a particular string within the digits of pi.

And if an adversary in a simple universe (complexity K(U')) "hijacks" you by ins... (read more)

Redwood's Technique-Focused Epistemic Strategy

In terms of how this strategy breaks, I think there's a lot of human guidance required to avoid either trying variations on the same not-quite-right ideas over and over, or trying a hundred different definitely-not-right ideas.

Given comfort and inertia, I expect the average research group to need impetus towards mixing things up. And they're smart people, so I'm looking forward to seeing what they do next.

Understanding Gradient Hacking

I think this is a totally fine length. But then, I would :P

I still feel like this was a little gears-light. Do the proposed examples of gradient hacking really work if you make a toy neural network with them? (Or does gradient descent find a way around the apparent local minimum?)

Load More