All of Vanessa Kosoy's Comments + Replies

chinchilla's wild implications

it would be the best possible model of this type, at the task of language modeling on data sampled from the same distribution as MassiveText

Transformers a Turing complete, so "model of this type" is not much of a constraint. On the other hand, I guess it's theoretically possible that some weight matrices are inaccessible to current training algorithms no matter how much compute and data we have. It seems also possible that the scaling law doesn't go on forever, but phase-transitions somewhere (maybe very far) to a new trend which goes below the "irreducible" term.

Vanessa Kosoy's Shortform

Master post for ideas about infra-Bayesian physicalism.

Other relevant posts:

The Pragmascope Idea

Telephone Theorem, Redundancy/Resampling, and Maxent for the math, Chaos for the concepts.

Thank you!

Just because something can be learned efficiently doesn't mean it's convergent for a wide variety of cognitive systems.

I believe that the relevant cognitive systems all look like learning algorithms for a prior of certain fairly specific type. I don't know how this prior looks like, but it's something very rich on the one hand and efficiently learnable on the other hand. So, if you showed that your formalism naturally produces priors that seem closer ... (read more)

The Pragmascope Idea

As I see it, the core theory of natural abstractions is now 80% nailed down

Question 1: What's the minimal set of articles one should read to understand this 80%?

Question/Remark 2: AFAICT, your theory has a major missing piece, which is, proving that "abstraction" (formalized according to your way of formalizing it) of is actually a crucial ingredient of learning/cognition. The way I see it, such a proof should be by demonstrating that hypothesis classes defined in terms of probabilistic graph models / abstraction hierarchies can be learned with good sa... (read more)

4johnswentworth2d
Telephone Theorem [https://www.lesswrong.com/posts/jJf4FrfiQdDGg7uco/the-telephone-theorem-information-at-a-distance-is-mediated] , Redundancy/Resampling [https://www.lesswrong.com/posts/vvEebH5jEvxnJEvBC/abstractions-as-redundant-information] , and Maxent [https://www.lesswrong.com/posts/cqdDGuTs2NamtEhBW/maxent-and-abstractions-current-best-arguments] for the math, Chaos [https://www.lesswrong.com/posts/zcCtQWQZwTzGmmteE/chaos-induces-abstractions] for the concepts. If we want to show that abstraction is a crucial ingredient of learning/cognition, then "Can we efficiently learn hypothesis classes defined in terms of abstraction hierarchies, as captured by John's formalism?" is entirely the wrong question. Just because something can be learned efficiently doesn't mean it's convergent for a wide variety of cognitive systems. And even if such hypothesis classes couldn't be learned efficiently in full generality, it would still be possible for a subset of that hypothesis class to be convergent for a wide variety of cognitive systems, in which case general properties of the hypothesis class would still apply to those systems' cognition. The question we actually want here is "Is abstraction, as captured by John's formalism, instrumentally convergent for a wide variety of cognitive systems?". And that question is indeed not yet definitively answered. The pragmascope itself would largely allow us to answer that question empirically, and I expect the ability to answer it empirically will quickly lead to proofs as well.
Principles of Privacy for Alignment Research

Our work doesn't necessarily need wide memetic spread to be found by the people who know what to look for. E.g. people playing through the alignment game tree are a lot more likely to realize that ontology identification, grain-of-truth, value drift, etc, are key questions to ask, whereas ML researchers just pushing toward AGI are a lot less likely to ask those questions.

That's a valid argument, but I can also imagine groups that (i) in a world where alignment research is obscure proceed to create unaligned AGI (ii) in a world where alignment research i... (read more)

Principles of Privacy for Alignment Research

[For the record, here's previous relevant discussion]

My problem with the "nobody cares" model is that it seems self-defeating. First, if nobody cares about my work, then how would my work help with alignment? I don't put a lot of stock into building aligned AGI in the basement on the my own. (And not only because I don't have a basement.) Therefore, any impact I will have flows through my work becoming sufficiently known that somebody who builds AGI ends up using it. Even if I optimistically assume that I will personally be part of that project, my work ne... (read more)

3johnswentworth10d
Our work doesn't necessarily need wide memetic spread to be found by the people who know what to look for. E.g. people playing through the alignment game tree are a lot more likely to realize that ontology identification, grain-of-truth, value drift, etc, are key questions to ask, whereas ML researchers just pushing toward AGI are a lot less likely to ask those questions. I do agree that a growing alignment community will add memetic fitness to alignment work in general, which is at least somewhat problematic for the "nobody cares" model. And I do expect there to be at least some steps which need a fairly large alignment community doing "normal" (i.e. paradigmatic) incremental research. For instance, on some paths we need lots of people doing incremental interpretability/ontology research to link up lots of concepts to their representations in a trained system. On the other hand, not all of the foundations need to be very widespread - e.g. in the case of incremental interpretability/ontology research, it's mostly the interpretability tools which need memetic reach, not e.g. theory around grain-of-truth or value drift.
Human values & biases are inaccessible to the genome

I think that "directly specified" is just an ill-defined concept. You can ask whether A specifies B using encoding C. But if you don't fix C? Then any A can be said to "specify" any B (you can always put the information into C). Algorithmic information theory might come to the rescue by rephrasing the question as: "what is the relative Kolmogorov complexity K(B|A)?" Here, however, we have more ground to stand on, namely there is some function where is the space of genomes, is the space of environments and is the space of brains. Also we might... (read more)

Human values & biases are inaccessible to the genome

Well, how do you define "directly specified"? If human brains reliably converge towards a certain algorithm, then effectively this algorithm is specified by the genome. The real question is, which parts depends only on genes and which parts depend on the environment. My tentative opinion is that the majority is in the genes, since humans are, broadly speaking, pretty similar to each other. One environment effect is, feral humans grow up with serious mental problems. But, my guess is, this is not because of missing "values" or "biases", but (to 1st approxim... (read more)

2Alex Turner21d
I don't classify "convergently learned" as an instance of "directly specified", but rather "indirectly specified, in conjunction with the requisite environmental data." Here's an example. I think that humans' reliably-learned edge detectors in V1 are not "directly specified", in the same way that vision models don't have directly specified curve detectors, but these detectors are convergently learned in order to do well on vision tasks. If I say "sunk cost is directly specified", I mean something like "the genome specifies neural circuitry which will eventually, in situations where sunk cost arises, fire so as to influence decision-making." However, if, for example, the genome lays out the macrostructure of the connectome and the broad-scale learning process and some reward circuitry and regional learning hyperparameters and some other details, and then this brain eventually comes to implement a sunk-cost bias, I don't call that "direct specification." I wish I had been more explicit about "direct specification", and perhaps this comment is still not clear. Please let me know if so!
Human values & biases are inaccessible to the genome

My reasoning can be roughly described as:

  • There is a simple mathematical theory of agency, similarly to how there is are simple mathematical theories of e.g. probability of computational complexity
  • This theory should include, explaining how agents can have goals defined not in terms of sensory data
  • I have a current best guess to what the outline of this theory looks like, based on (i) simplicity (ii) satisfying natural-seeming desiderata and (iii) ability to prove relevant non-trivial theorems (for example, infra-Bayesian reinforcement learning theory is
... (read more)
4Quintin Pope24d
I'd note that it's possible for an organism to learn to behave (and think) in accordance with the "simple mathematical theory of agency" you're talking about, without said theory being directly specified by the genome. If the theory of agency really is computationally simple, then many learning processes probably converge towards implementing something like that theory, simply as a result of being optimized to act coherently in an environment over time.
Human values & biases are inaccessible to the genome

I think the way it works is approximately as follows. There is a fixed "ontological" infra-POMDP which is a coarse hard-coded world-model sufficient to define the concepts on which the reward depends (for humans, it would includes concepts such as "other humans"). Then there is a prior which is composed of refinements of this infra-POMDP. The reward depends on state of the ontological IPOMDP, so it is allowed to depend on the concepts of the hard-cord world-model (but not on the concepts which only exist in the refined models). Ofc, this leaves open the qu... (read more)

3Alex Turner1mo
Without knowing the details of infra-POMDPs or your other work, by what Bayesian evidence do you raise this particular hypothesis to consideration [https://www.readthesequences.com/Privileging-The-Hypothesis]? (I say this not to imply that you do not have such evidence, only that I do not presently see why I should consider this particular hypothesis.)
Vanessa Kosoy's Shortform

Here's a video of a talk I gave about PreDCA.

Formal Philosophy and Alignment Possible Projects

Something like Bayesian/expected utility maximization seems useful for understanding agents and agency. However, there is the problem that expected utility theory doesn’t seem to predict anything in particular. We want a better response to “Expected utility theory doesn’t predict anything” that can describe the insight of EU theory re what agents are without being misinterpreted / without failing to constrain expectations at all technically.

Agents are policies with a high value of g. So, "EU theory" does "predict" something, although it's a "soft" prediction (i.e. agency is a matter of degree).

A central AI alignment problem: capabilities generalization, and the sharp left turn

When I say "policy", I mean the entire behavior including the learning algorithm, not some asymptotic behavior the system is converging to. Obviously, the policy is represented as genetic code, not as individual decisions. When I say "evolution is directly selecting the policy", I mean that genotypes are selected based on their "expected reward" (reproductive fitness) rather than e.g. by evaluating the accuracy of the world-models those minds produce[1]. And, genotypes are not a priori constrained to be learning algorithms with particular architectures, th... (read more)

Where I agree and disagree with Eliezer

The word "bounded" in "bounded simplicity prior" referred to bounded computational resources. A "bounded simplicity prior" is a prior which involves either a "hard" (i.e. some hypotheses are excluded) or a "soft" (i.e. some hypotheses are down-weighted) bound on computational resources (or both), and also inductive bias towards simplicity (specifically it should probably behave as ~ 2^{-description complexity}). For a concrete example, see the prior I described here (w/o any claim to originality).

2Richard Ngo2mo
Ah, I see. That makes sense now!
Where I agree and disagree with Eliezer

Not quite sure what you're saying here. Is the claim that speed penalties would help shift the balance against mesa-optimizers? This kind of solutions are worth looking into, but I'm not too optimistic about them atm. First, the mesa-optimizer probably won't add a lot of overhead compared to the considerable complexity of emulating a brain. In particular, it need not work by anything like our own ML algorithms. So, if it's possible to rule out mesa-optimizers like this, it would require a rather extreme penalty. Second, there are limits on how much you can... (read more)

2Richard Ngo2mo
No, I wasn't advocating adding a speed penalty, I was just pointing at a reason to think that a speed prior would give a more accurate answer to the question of "which is favored" than the bounded simplicity prior you're assuming: But now I realise that I don't understand why you think this is true of transformers. Could you explain? It seems to me that there are many very simple hypotheses which take a long time to calculate, and which transformers therefore can't be representing.
Where I agree and disagree with Eliezer

Epistemic status: some of these ideas only crystallized today, normally I would take at least a few days to process before posting to make sure there are no glaring holes in the reasoning, but I saw this thread and decided to reply since it's topical.

Suppose that your imitator works by something akin to Bayesian inference with some sort of bounded simplicity prior (I think it's true of transformers). In order for Bayesian inference to converge to exact imitation, you usually need realizability. Obviously today we don't have realizability because the ANNs c... (read more)

2Richard Ngo2mo
In a deep learning context, the latter hypothesis seems much more heavily favored when using a simplicity prior (since gradient descent is simple to specify) than a speed prior (since gradient descent takes a lot of computation). So as long as the compute costs of inference remain smaller than the compute costs of training, a speed prior seems more appropriate for evaluating how easily hypotheses can become more epistemically sophisticated than the outer loop.
A central AI alignment problem: capabilities generalization, and the sharp left turn

I want to outline how my research programme attempts to address this core difficulty.

First, like I noted before, evolution is not a perfect analogy for AI. This is because evolution is directly selecting the policy, whereas a (model-based) AI system is separately selecting (i) a world-model (ii) a reward function and (iii) a plan (policy) based on i+ii. This inherently produces better generalization-of-alignment (but not nearly enough to solve the problem).

With iii, we have the least generalization problems, because we are not limited by training data: the... (read more)

3Alex Turner1mo
Huh? Evolution did not directly select over human policy decisions. Evolution specified brains, which do within-lifetime learning and therefore learn different policies given different upbringings, and e.g. learning rate mutations indirectly leads to statistical differences in human learned policies. Evolution probably specifies some reward circuitry, the learning architecture, the broad-strokes learning processes (self-supervised predictive + RL), and some other factors, from which the policy is produced. The IGF->human values analogy is indeed relevantly misleading IMO, but not for this reason.
AGI Ruin: A List of Lethalities

My 0th approximation answer is: you're describing something logically incoherent, like a p-zombie.

My 1st approximation answer is more nuanced. Words that, in the pre-Turing era, referred exclusively to humans (and sometimes animals, and fictional beings), such as "wants", "experiences" et cetera, might have two different referents. One referent is a natural concept, something tied into deep truths about how the universe (or multiverse) works. In particular, deep truths about the "relatively simple core structure that explains why complicated cognitive mach... (read more)

Vanessa Kosoy's Shortform

Infra-Bayesian physicalism is an interesting example in favor of the thesis that the more qualitatively capable an agent is, the less corrigible it is. (a.k.a. "corrigibility is anti-natural to consequentialist reasoning"). Specifically, alignment protocols that don't rely on value learning become vastly less safe when combined with IBP:

  • Example 1: Using steep time discount to disincentivize dangerous long-term plans. For IBP, "steep time discount" just means, predominantly caring about your source code running with particular short inputs. Such a goal s

... (read more)
AGI Ruin: A List of Lethalities

Second, the only reason why the question "what X wants" can make sense at all, is because X is an agent. As a corollary, it only makes sense to the extent that X is an agent.

I'm not sure this is true; or if it's true, I'm not sure it's relevant.

If we go down that path then it becomes the sort of conversation where I have no idea what common assumptions do we have, if any, that we could use to agree. As a general rule, I find it unconstructive, for the purpose of trying to agree on anything, to say things like "this (intuitively compelling) assumpti... (read more)

1Rob Bensinger2mo
Fair enough! I don't think I agree in general, but I think 'OK, but what's your alternative to agency?' is an especially good case for this heuristic. The first counter-example that popped into my head was "a mind that lacks any machinery for considering, evaluating, or selecting actions; but it does have machinery for experiencing more-pleasurable vs. less pleasurable states". This is a mind we should be able to build, even if it would never evolve naturally. Possibly this still qualifies as an "agent" that "wants" and "pursues" things, as you conceive it, even though it doesn't select actions?
AGI Ruin: A List of Lethalities

Humans are at least a little coherent, or we would never get anything done; but we aren't very coherent, so the project of piecing together 'what does the human brain as a whole "want"' can be vastly more difficult than the problem of figuring out what a coherent optimizer wants.

This is a point where I feel like I do have a substantial disagreement with the "conventional wisdom" of LessWrong.

First, LessWrong began with a discussion of cognitive biases in human irrationality, so this naturally became a staple of the local narrative. On the other hand, I ... (read more)

Second, the only reason why the question "what X wants" can make sense at all, is because X is an agent. As a corollary, it only makes sense to the extent that X is an agent.

I'm not sure this is true; or if it's true, I'm not sure it's relevant. But assuming it is true...

Therefore, if X is not entirely coherent then X's preferences are only approximately defined, and hence we only need to infer them approximately.

... this strikes me as not capturing the aspect of human values that looks strange and complicated. Two ways I could imagine the strangeness and ... (read more)

AGI Ruin: A List of Lethalities

First, some remarks about the meta-level:

The ability to do new basic work noticing and fixing those flaws is the same ability as the ability to write this document before I published it, which nobody apparently did, despite my having had other things to do than write this up for the last five years or so. Some of that silence may, possibly, optimistically, be due to nobody else in this field having the ability to write things comprehensibly - such that somebody out there had the knowledge to write all of this themselves, if they could only have written

... (read more)

There is a big chunk of what you're trying to teach which not weird and complicated, namely: "find this other agent, and what their values are". Because, "agents" and "values" are natural concepts, for reasons strongly related to "there's a relatively simple core structure that explains why complicated cognitive machines work".

This seems like it must be true to some degree, but "there is a big chunk" feels a bit too strong to me.

Possibly we don't disagree, and just have different notions of what a "big chunk" is. But some things that make the chunk feel sm... (read more)

Six Dimensions of Operational Adequacy in AGI Projects

Maybe corrigibility / task AGI / etc. is harder than CEV, but it just doesn't seem realistic to me to try to achieve full, up-and-running CEV with the very first AGI systems you build, within a few months or a few years of humanity figuring out how to build AGI at all.

The way I imagine the win scenario is, we're going to make a lot of progress in understanding alignment before we know how to build AGI. And, we're going to do it by prioritizing understanding alignment modulo capability (the two are not really possible to cleanly separate, but it might be... (read more)

1Rob Bensinger2mo
The latter, as I was imagining "95%".
Six Dimensions of Operational Adequacy in AGI Projects

I agree that it's a tricky problem, but I think it's probably tractable. The way PreDCA tries to deal with these difficulties is:

  • The AI can tell that, even before the AI was turned on, the physical universe was running certain programs.
  • Some of those programs are "agentic" programs.
  • Agentic programs have approximately well-defined utility functions.
  • Disassembling the humans doesn't change anything, since it doesn't affect the programs that were already running[1] before the AI was turned on.
  • Since we're looking at agent-programs rather than specific agen
... (read more)
Six Dimensions of Operational Adequacy in AGI Projects

Before humanity gets to steps 1-2 ('use CEV or something to make the long-term future awesome'), it needs to get past steps 3-6 ('use limited task AGI to ensure that humanity doesn't kill itself with AGI so we can proceed to take our time with far harder problems like "what even is CEV" and "how even in principle would one get an AI system to robustly do anything remotely like that, without some subtle or not-so-subtle disaster resulting"').

I want to register my skepticism about this claim. Whereas it might naively seem that "put a strawberry on a plate... (read more)

2[comment deleted]2mo

And if humans had a utility function and we knew what that utility function was, we would not need CEV.  Unfortunately extracting human preferences over out-of-distribution options and outcomes at dangerously high intelligence, using data gathered at safe levels of intelligence and a correspondingly narrower range of outcomes and options, when there exists no sensory ground truth about what humans want because human raters can be fooled or disassembled, seems pretty complicated.  There is ultimately a rescuable truth about what we want, and CEV i... (read more)

1Rob Bensinger2mo
Yeah, I'm very interested in hearing counter-arguments to claims like this. I'll say that although I think task AGI is easier, it's not necessarily strictly easier, for the reason you mentioned. Maybe a cruxier way of putting my claim is: Maybe corrigibility / task AGI / etc. is harder than CEV, but it just doesn't seem realistic to me to try to achieve full, up-and-running CEV with the very first AGI systems you build, within a few months or a few years of humanity figuring out how to build AGI at all. And I do think you need to get CEV up and running within a few months or a few years, if you want to both (1) avoid someone else destroying the world first, and (2) not use a "strawberry-aligned" AGI to prevent 1 from happening. All of the options are to some extent a gamble, but corrigibility, task AGI, limited impact, etc. strike me as gambles that could actually realistically work out well for humanity even under extreme time pressure to deploy a system within a year or two of 'we figure out how to build AGI'. I don't think CEV is possible under that constraint. (And rushing CEV and getting it only 95% correct poses far larger s-risks than rushing low-impact non-operator-modeling strawberry AGI and getting it only 95% correct.)
Vanessa Kosoy's Shortform

Here's a question inspired by thinking about Turing RL, and trying to understand what kind of "beliefs about computations" should we expect the agent to acquire.

Does mathematics have finite information content?

First, let's focus on computable mathematics. At first glance, the answer seems obviously "no": because of the halting problem, there's no algorithm (i.e. a Turing machine that always terminates) which can predict the result of every computation. Therefore, you can keep learning new facts about results of computations forever. BUT, maybe most of thos... (read more)

3Alex Mennen3mo
Wikipedia claims [https://en.wikipedia.org/wiki/Algorithmically_random_sequence#Properties_and_examples_of_Martin-L%C3%B6f_random_sequences] that every sequence is Turing reducible to a random one, giving a positive answer to the non-resource-bounded version of any question of this form. There might be a resource-bounded version of this result as well, but I'm not sure.
Vanessa Kosoy's Shortform

Two more remarks.

User Detection

It can be useful to identify and assist specifically the user rather than e.g. any human that ever lived (and maybe some hominids). For this purpose I propose the following method. It also strengthens the protocol by relieving some pressure from other classification criteria.

Given two agents and , which can ask which points on 's timeline are in the causal past of which points of 's timeline. To answer this, consider the counterfactual in which takes a random action (or sequence of actions) at some point (or interval... (read more)

Infra-Topology

All credit for this beautiful work goes to Alex.

Vanessa Kosoy's Shortform

Some additional thoughts.

Non-Cartesian Daemons

These are notoriously difficult to deal with. The only methods I know are that applicable to other protocols are homomorphic cryptography and quantilization of envelope (external computer) actions. But, in this protocol, they are dealt with the same as Cartesian daemons! At least if we assume a non-Cartesian attack requires an envelope action, the malign hypotheses which are would-be sources of such actions are discarded without giving an opportunity for attack.

Weaknesses

My main concerns with this approach ... (read more)

Vanessa Kosoy's Shortform

Precursor Detection, Classification and Assistance (PreDCA)

Infra-Bayesian physicalism provides us with two key building blocks:

  • Given a hypothesis about the universe, we can tell which programs are running. (This is just the bridge transform.)
  • Given a program, we can tell whether it is an agent, and if so, which utility function it has[1] (the "evaluating agent" section of the article).

I will now outline how we can use these building blocks to solve both the inner and outer alignment problem. The rough idea is:

  • For each hypothesis in the prior, check
... (read more)
2Vanessa Kosoy1mo
Here's a video [https://www.youtube.com/watch?v=24vIJDBSNRI] of a talk I gave about PreDCA.
2Vanessa Kosoy3mo
Two more remarks. USER DETECTION It can be useful to identify and assist specifically the user rather than e.g. any human that ever lived (and maybe some hominids). For this purpose I propose the following method. It also strengthens the protocol by relieving some pressure from other classification criteria. Given two agents G and H, which can ask which points on G's timeline are in the causal past of which points of H's timeline. To answer this, consider the counterfactual in which G takes a random action (or sequence of actions) at some point (or interval) on G's timeline, and measure the mutual information between this action(s) and H's observations at some interval on H's timeline. Using this, we can effectively construct a future "causal cone" emanating from the AI's origin, and also a past causal cone emanating from some time t on the AI's timeline. Then, "nearby" agents will meet the intersection of these cones for low values of t whereas "faraway" agents will only meet it for high values of t or not at all. To first approximation, the user would be the "nearest" precursor[1] [#fn-ZN7Zhqkk6GqFZdaJd-1] agent i.e. the one meeting the intersection for the minimal t. More precisely, we expect the user's observations to have nearly maximal mutual information with the AI's actions: the user can e.g. see every symbol the AI outputs to the display. However, the other direction is less clear: can the AI's sensors measure every nerve signal emanating from the user's brain? To address this, we can fix t to a value s.t. we expect only the user the meet the intersection of cones, and have the AI select the agent which meets this intersection for the highest mutual information threshold. This probably does not make the detection of malign agents redundant, since AFAICT a malign simulation hypothesis might be somehow cleverly arranged to make a malign agent the user. MORE ON COUNTERFACTUALS In the parent post I suggested "instead of examining only Θ we also examine co
2Vanessa Kosoy4mo
Some additional thoughts. NON-CARTESIAN DAEMONS [HTTPS://WWW.LESSWRONG.COM/POSTS/5BD75CC58225BF0670375575/THE-LEARNING-THEORETIC-AI-ALIGNMENT-RESEARCH-AGENDA#TAMING_DAEMONS] These are notoriously difficult to deal with. The only methods I know are that applicable to other protocols are homomorphic cryptography and quantilization of envelope (external computer) actions. But, in this protocol, they are dealt with the same as Cartesian daemons! At least if we assume a non-Cartesian attack requires an envelope action, the malign hypotheses which are would-be sources of such actions are discarded without giving an opportunity for attack. WEAKNESSES My main concerns with this approach are: * The possibility of major conceptual holes in the definition of precursors. More informal analysis can help, but ultimately mathematical research in infra-Bayesian physicalism in general and infra-Bayesian cartesian/physicalist multi-agent [https://www.lesswrong.com/posts/dPmmuaz9szk26BkmD/shortform?commentId=uZ5xq73xmZSTSZN33] interactions in particular is required to gain sufficient confidence. * The feasibility of a good enough classifier. At present, I don't have a concrete plan for attacking this, as it requires inputs from outside of computer science. * Inherent "incorrigibility": once the AI becomes sufficiently confident that it correctly detected and classified its precursors, its plans won't defer to the users any more than the resulting utility function demands. On the second hand, I think the concept of corrigibility is underspecified [https://www.lesswrong.com/posts/dPmmuaz9szk26BkmD/shortform?commentId=5Rxgkzqr8XsBwcEQB] so much that I'm not sure it is solved (rather than dissolved) even in the Book [https://www.lesswrong.com/posts/34Gkqus9vusXRevR8/late-2021-miri-conversations-ama-discussion?commentId=PYHHJkHcS55ekmWEE] . Moreover, the concern can be ameliorated by sufficiently powerful interpretabi
Vanessa Kosoy's Shortform

Infradistributions admit an information-theoretic quantity that doesn't exist in classical theory. Namely, it's a quantity that measures how many bits of Knightian uncertainty an infradistribution has. We define it as follows:

Let be a finite set and a crisp infradistribution (credal set) on , i.e. a closed convex subset of . Then, imagine someone trying to communicate a message by choosing a distribution out of . Formally, let be any other finite set (space of messages), (prior over messages) and (communication protocol). Consider the ... (read more)

AXRP Episode 14 - Infra-Bayesian Physicalism with Vanessa Kosoy

Multiple branches can only exist transiently during the weird experiment (i.e. neither before nor after). Naturally, if the agent knows in advance the experiment is going to happen, then it anticipates those branches to appear.

AXRP Episode 14 - Infra-Bayesian Physicalism with Vanessa Kosoy

The wavefunction has other branches, because it's the same mathematical object governed by the same equations. Only, the wavefunction doesn't exist physically, it's just an intermediate variable in the computation. The things that exist (corresponding to the variable in the formalism) and the things that are experienced (corresponding to some function of the variable in the formalism) only have one branch.

1Adele Lopez4mo
So in a "weird experiment", the infrabayesian starts by believing only one branch exists, and then at some point starts believing in multiple branches?
AXRP Episode 14 - Infra-Bayesian Physicalism with Vanessa Kosoy

Btw, there is some amount of philosophical convergence between this and some recent work I did on critical agential physics;

Thanks, I'll look at that!

It seems like "infra-Bayesianism" may be broadly compatible with frequentism;

Yes! In frequentism, we define probability distributions as limits of frequencies. One problem with this is, what to do if there's no convergence? In the real world, there won't be convergence unless you have an infinite sequence of truly identical experiments, which you never have. At best, you have a long sequence of similar... (read more)

AXRP Episode 14 - Infra-Bayesian Physicalism with Vanessa Kosoy

By "weird experiment" I mean things like, reversing decoherence. That is, something designed to cause interference between branches of the wavefunction with minds that remember different experiences[1]. Which obviously requires levels of technology we are nowhere near to reaching[2]. As long as decoherence happens as usual, there is only one copy.


  1. Ofc it requires erasing their contradicting memories among other things. ↩︎

  2. There is a possible "shortcut" though, namely simulating minds on quantum computers. Naturally, in this case only the quantum-upload

... (read more)
1Adele Lopez4mo
If there aren't other branches, then shouldn't that be impossible? Not just in practice but in principle.
ELK Computational Complexity: Three Levels of Difficulty

Sidenote: there seems to be some connection between this line of thought and my ideas about antitraining (how to design a learning algorithm for task D that doesn't develop latent knowledge about task E).

Job Offering: Help Communicate Infrabayesianism

One of the problems with imprecise bayesianism is that they haven't come up with a good update rule -- turns out it's much trickier than it looks. You can't just update all the distributions in the set, because [reasons i am forgetting]. Part of the reason infrabayes generalizes imprecise bayes is to fix this problem.

The reason you can't just update all the distributions in the set is, it wouldn't be dynamically consistent. That is, planning ahead what to do in every contingency versus updating and acting accordingly would produce different policies.

The... (read more)

ELK Thought Dump

The main difficulty is that you still need to translate between the formal language of computations and something humans can understand in practice (which probably means natural language). This is similar to Dialogic RL. So you still need an additional subsystem for making this translation, e.g. AQD. At which point you can ask, why not just apply AQD directly to a pivotal[1] action?

I'm not sure what the answer is. Maybe we should apply AQD directly, or maybe AQD is too weak for pivotal actions but good enough for translation. Or maybe it's not even good en... (read more)

Late 2021 MIRI Conversations: AMA / Discussion

I'm going to try and write a table of contents for the textbook, just because it seems like a fun exercise.

Epistemic status: unbridled speculation

Volume I: Foundation

  • Preface [mentioning, ofc, the infamous incident of 2041]
  • Chapter 0: Introduction

Part I: Statistical Learning Theory

  • Chapter 1: Offline Learning [VC theory and Watanabe's singular learning theory are both special cases of what's in this chapter]
  • Chapter 2: Online Learning [infra-Bayesianism is introduced here, Garrabrant induction too]
  • Chapter 3: Reinforcement Learning
  • Chapter 4: Lifelong
... (read more)
ELK Thought Dump

A quick comment after skimming: IBP might be relevant here, because it formalizes computationalism and provides a natural "objective" domain of truth (namely i.e. which computations are "running" and what values do they take).

2Abram Demski5mo
Any more detailed thoughts on its relevance? EG, a semi-concrete ELK proposal based on this notion of truth/computationalism? Can identifying-running-computations stand in for direct translation?
Shah and Yudkowsky on alignment failures

If that's the plan, then I guess my next question is how we should go about limiting the strategy space and/or reducing the search quality? (Taking into account things like deception risk.)

I suggested doing this using quantilization.

Christiano and Yudkowsky on AI predictions and human intelligence

I'm not sure what's the difference between what you're saying here and what I said about QNIs. Is it that you expect being able to see the emergent technology before the singular (crossover) point? Actually, the fact you describe DL as "currently useless" makes me think we should be talking about progress as a function of two variables: time and "maturity", where maturity inhabits, roughly speaking, a scale from "theoretical idea" to "proof of concept" to "beats SOTA in lab conditions" to "commercial product". In this sense, the "lab progress" curve is alr... (read more)

Christiano and Yudkowsky on AI predictions and human intelligence
  1. In fields with lots of spending and impact, most technological progress is made gradually rather than abruptly, for most ways of measuring...

Your arguments about technology X apply to any other technological goal---better fusion reactors or solar panels, more generally cheaper energy, rockets, semiconductors, whatever. So it seems like they should be visible in the base rate for 2. Do you think that a significant fraction of technological progress is abrupt and unpredictable in the sense that you are saying TAI will probably be?

I think that you can... (read more)

Certainly I don't see fusion reactors, solar panels or (use in electronics of) semiconductors as counterexamples, since each of these was invented at some point, and didn't gradually evolve from some completely different technology.

Your definition of "discontinuity" seems broadly compatible with my view of the future then. Definitely there are different technologies that are not all outgrowths of one another.

My main point of divergence is:

Now, when a QNI comes along, it doesn't necessarily look like a discontinuity, because there might be a lot of work to

... (read more)
Christiano and Yudkowsky on AI predictions and human intelligence

I don't quite see how this is a problem for the model. The narrower you draw the boundary, the more jumpy progress will be, right?

So, you're saying: if we draw the boundary around a narrow field, we get jumpy/noisy progress. If we the draw the boundary around a broad field, all the narrow subfields average out and the result is less noise. This makes a lot of sense, thank you!

The question is, what metric do we use to average the subfields. For example, on some metrics the Manhattan project might be a rather small jump in military-technology-averaged-ove... (read more)

Christiano and Yudkowsky on AI predictions and human intelligence

I don't see what it has to do with risk-return. Sure, many startups fail. And, plausibly many people tried to build an airplane and failed before the Wright brothers. And, many people keep trying building AGI and failing. This doesn't mean there won't be kinks in AI progress or even a TAI created by a small group.

Saying that "the subjective expected value of AI progress over time is a smooth curve" is a very different proposition from "the actual AI progress over time will be a smooth curve".

My line of argument here is not trying to prove a particular stor... (read more)

4Paul Christiano5mo
What is the confidence level of predictions you are pushing back against? I'm at like 30% on fast takeoff in the sense of "1 year doubling without preceding 4 year doubling" (a threshold roughly set to break any plausible quantitative historical precedent a threshold intended to be faster than historical precedent but that's probably similar to the agricultural revolution sped up 10,000x). I'm at maybe 10-20% on the kind of crazier world Eliezer imagines. Is that a high level of confidence? I'm not sure I would be able to spread my probability in a way that felt unconfident (to me) without giving probabilities that low to lots of particular ways the future could be crazy. E.g. 10-20% is similar to the probability I put on other crazy-feeling possibilities like no singularity at all, rapid GDP acceleration with only moderate cognitive automation, or singleton that arrests economic growth before we get to 4 year doubling times...
Christiano and Yudkowsky on AI predictions and human intelligence

Yes, this is something I discuss in the edit (you probably started typing your reply before I posted it).

3Rohin Shah5mo
The "continuous view" argument is about takeoff speeds, not about AI risk? If AI risk arose from narrow systems that couldn't produce a billion dollars of value then I'd expect that risk could arise more discontinuously from a new paradigm. But AI risk arises from systems that are sufficiently intelligent that they could produce billions of dollars of value.
Christiano and Yudkowsky on AI predictions and human intelligence

Christiano's model of progress, AFAIU, can be summarized as: "When only a few people work in a field, big jumps in progress are possible. When many people work in a field, the low hanging fruits are picked quickly and then progress is smooth."

The problem with this model is, its predictions depend a lot on how you draw the boundary around "field". Take Yudkowsky's example of startups. How do we explain small startups succeed where large companies failed? And, it's not lack of economic incentives since successful startups sometimes make huge profits? (Often ... (read more)

5Paul Christiano5mo
My view is: 1. If X is obviously very valuable, many people will work on achieving X (potentially including lots of different approaches they see to achieving X). 2. In fields with lots of spending and impact, most technological progress is made gradually rather than abruptly, for most ways of measuring. 3. We haven't said very much at all about why AI should be one of the exceptions (compare to the situation when discussing nuclear weapons, where you can make fantastic arguments about why it would be different). Eliezer's argument about criticality seems to me to just not work unless one rejects 1+2 altogether (unlike nuclear weapons). Your arguments about technology X apply to any other technological goal---better fusion reactors or solar panels, more generally cheaper energy, rockets, semiconductors, whatever. So it seems like they should be visible in the base rate for 2. Do you think that a significant fraction of technological progress is abrupt and unpredictable in the sense that you are saying TAI will probably be? I don't know exactly what you are responding to here. I have some best guesses about what progress will look like, but they are pretty separate from the broader heuristic. And I'm not sure this is a fair representation of my actual view, within 20 years I think it's reasonably likely that AI will look fairly different, on a scale of 5 years that seems kind of unlikely. I'm predicting that the performance of AI systems will grow relatively continuously and predictably, not that AI isn't risky or even that risk will emerge gradually. I think it's pretty unclear how this bears on the general schema above. But I'm happy to consider particular examples of startups that made rapid/unpredictable progress towards particular technological goals (perhaps by pursuing a new approach), since those are the kind of thing I'm predicting are rare. They sure look rare to me (i.e. are responsible for a very small share of total te

I'm guessing that a proponent of Christiano's theory would say: sure, such-and-such startup succeeded but it was because they were the only ones working on problem P, so problem P was an uncrowded field at the time. Okay, but why do we draw the boundary around P rather than around "software" or around something in between which was crowded?

I'd make a different reply: you need to not just look at the winning startup, but all startups. If it's the case that the 'startup ecosystem' is earning 100% returns and the rest of the economy is earning 5% returns, the... (read more)

3ESRogs5mo
I don't quite see how this is a problem for the model. The narrower you draw the boundary, the more jumpy progress will be, right? Successful startups are big relative to individuals, but not that big relative to the world as a whole. If we're talking about a project / technology / company that can rival the rest of the world in its output, then the relevant scale is trillions of dollars (prob deca-trillions), not billions. And while the most fantastically successful startups can become billion dollar companies within a few years, nobody has yet made it to a trillion [https://companiesmarketcap.com/] in less than a decade. EDIT: To clarify, not trying to say that something couldn't grow faster than any previous startup. There could certainly be a 'kink' in the rate of progress, like you describe. I just want to emphasize that: 1. startups are not that jumpy, on the world scale 2. the actual scale of the world matters A simple model for the discontinuousness of a field might have two parameters — one for the intrinsic lumpiness of available discoveries, and one for total effort going into discovery. And, * all else equal, more people means smoother progress — if we lived in a trillion person world, AI progress would be more continuous * it's an open empirical question whether the actual values for these parameters will result in smooth or jumpy takeoff: * even if investment in AI is in the deca-trillions and a meaningful fraction of all world output, it could still be that the actual territory of available discoveries is so lumpy that progress is discontinuous * but, remember that reality has a surprising amount of detail [http://johnsalvatier.org/blog/2017/reality-has-a-surprising-amount-of-detail] , which I think tends to push things in a smoother direction — it means there are more fiddly details to work through, even when you have a unique insight or technological advantage * or, in other words, even i
2Rohin Shah5mo
On my version of the "continuous view", the Technology X story seems plausible, but it starts with a shitty version of Technology X that doesn't immediately produce billions of dollars of impact (or something similar, e.g. killing all humans), that then improves faster than the existing technology, such that an outside observer looking at both technologies could use trend extrapolation to predict that Technology X would be the one to reach TAI. (And you can make this prediction at least, say, 3 years in advance of TAI, i.e. Technology X isn't going to be accelerating so fast that you have zero time to react.)
The Reasonable Effectiveness of Mathematics or: AI vs sandwiches

In this post I speculated on the reasons for why mathematics is so useful so often, and I still stand behind it. The context, though, is the ongoing debate in the AI alignment community between the proponents of heuristic approaches and empirical research[1] ("prosaic alignment") and the proponents of building foundational theory and mathematical analysis (as exemplified in MIRI's "agent foundations" and my own "learning-theoretic" research agendas).

Previous volleys in this debate include Ngo's "realism about rationality" (on the anti-theory side), the pro... (read more)

Clarifying inner alignment terminology

This post aims to clarify the definitions of a number of concepts in AI alignment introduced by the author and collaborators. The concepts are interesting, and some researchers evidently find them useful. Personally, I find the definitions confusing, but I did benefit a little from thinking about this confusion. In my opinion, the post could greatly benefit from introducing mathematical notation[1] and making the concepts precise at least in some very simplistic toy model.

In the following, I'll try going over some of the definitions and explicating my unde... (read more)

Load More