Vanessa Kosoy

AI alignment researcher supported by HUJI, MIRI and LTFF. Working on the learning-theoretic agenda.

E-mail: vanessa DOT kosoy AT {the thing reverse stupidity is not} DOT org

Wiki Contributions


Infra-Bayesian physicalism: a formal theory of naturalized induction

The short answer is, I don't know.

The long answer is, here are some possibilities, roughly ordered from "boring" to "weird":

  1. The framework is wrong.
  2. The framework is incomplete, there is some extension which gets rid of monotonicity. There are some obvious ways to make such extensions, but they look uglier and without further research it's hard to say whether they break important things or not.
  3. Humans are just not physicalist agents, you're not supposed to model them using this framework, even if this framework can be useful for AI. This is why humans took so much time coming up with science.
  4. Like #3, and also if we thought long enough we would become convinced of some kind of simulation/deity hypothesis (where the simulator/deity is a physicalist), and this is normatively correct for us.
  5. Because the universe is effectively finite (since it's asymptotically de Sitter), there are only so many computations that can run. Therefore, even if you only assign positive value to running certain computations, it effectively implies that running other computations is bad. Moreover, the fact the universe is finite is unsurprising since infinite universes tend to have all possible computations running which makes them roughly irrelevant hypotheses for a physicalist.
  6. We are just confused about hell being worse than death. For example, maybe people in hell have no qualia. This makes some sense if you endorse the (natural for physicalists) anthropic theory that only the best-off future copy of you matters. You can imagine there always being a "dead copy" of you, so that if something worst-than-death happens to the apparent-you, your subjective experiences go into the "dead copy".
Vanessa Kosoy's Shortform

The problem is that if implies that creates but you consider a counterfactual in which doesn't create then you get an inconsistent hypothesis i.e. a HUC which contains only 0. It is not clear what to do with that. In other words, the usual way of defining counterfactuals in IB (I tentatively named it "hard counterfactuals") only makes sense when the condition you're counterfactualizing on is something you have Knightian uncertainty about (which seems safe to assume if this condition is about your own future action but not safe to assume in general). In a child post I suggested solving this by defining "soft counterfactuals" where you consider coarsenings of in addition to itself.

chinchilla's wild implications

it would be the best possible model of this type, at the task of language modeling on data sampled from the same distribution as MassiveText

Transformers a Turing complete, so "model of this type" is not much of a constraint. On the other hand, I guess it's theoretically possible that some weight matrices are inaccessible to current training algorithms no matter how much compute and data we have. It seems also possible that the scaling law doesn't go on forever, but phase-transitions somewhere (maybe very far) to a new trend which goes below the "irreducible" term.

Vanessa Kosoy's Shortform

Master post for ideas about infra-Bayesian physicalism.

Other relevant posts:

The Pragmascope Idea

Telephone Theorem, Redundancy/Resampling, and Maxent for the math, Chaos for the concepts.

Thank you!

Just because something can be learned efficiently doesn't mean it's convergent for a wide variety of cognitive systems.

I believe that the relevant cognitive systems all look like learning algorithms for a prior of certain fairly specific type. I don't know how this prior looks like, but it's something very rich on the one hand and efficiently learnable on the other hand. So, if you showed that your formalism naturally produces priors that seem closer to that "holy grail prior", in terms of richness/efficiency, compared to priors that we already know (e.g. MDPs with small number of states which are not rich enough, or the Solomonoff prior which is both statistically and computationally intractable), that would at least be evidence that you're going in the right direction.

And even if such hypothesis classes couldn't be learned efficiently in full generality, it would still be possible for a subset of that hypothesis class to be convergent for a wide variety of cognitive systems, in which case general properties of the hypothesis class would still apply to those systems' cognition.

Hmm, I'm not sure what would it mean for a subset of a hypothesis class to be "convergent".

The question we actually want here is "Is abstraction, as captured by John's formalism, instrumentally convergent for a wide variety of cognitive systems?".

That's interesting, but I'm still not sure what it means exactly. Let's say we take a reinforcement learner which a specific hypothesis class, such all MDPs of certain size, or some family of MDPs with low eluder dimension, or the actual AIXI. How would you determine whether your formalism is "instrumentally convergent" for each of those? Is there a rigorous way to state the question?

The Pragmascope Idea

As I see it, the core theory of natural abstractions is now 80% nailed down

Question 1: What's the minimal set of articles one should read to understand this 80%?

Question/Remark 2: AFAICT, your theory has a major missing piece, which is, proving that "abstraction" (formalized according to your way of formalizing it) of is actually a crucial ingredient of learning/cognition. The way I see it, such a proof should be by demonstrating that hypothesis classes defined in terms of probabilistic graph models / abstraction hierarchies can be learned with good sample complexity (and better yet if you can tell something about the computational complexity), in a manner that cannot be achieved if you discard any of the important-according-to-you pieces. You might have some different approach to this, but I'm not sure what it is.

Principles of Privacy for Alignment Research

Our work doesn't necessarily need wide memetic spread to be found by the people who know what to look for. E.g. people playing through the alignment game tree are a lot more likely to realize that ontology identification, grain-of-truth, value drift, etc, are key questions to ask, whereas ML researchers just pushing toward AGI are a lot less likely to ask those questions.

That's a valid argument, but I can also imagine groups that (i) in a world where alignment research is obscure proceed to create unaligned AGI (ii) in a world where alignment research is famous, use this research when building their AGI. Maybe any such group would be operationally inadequate anyway, but I'm not sure. More generally, it's possible that in a world where alignment research is a well-known respectable field of study, more people take AI risk seriously.

...I do expect there to be at least some steps which need a fairly large alignment community doing "normal" (i.e. paradigmatic) incremental research. For instance, on some paths we need lots of people doing incremental interpretability/ontology research to link up lots of concepts to their representations in a trained system. On the other hand, not all of the foundations need to be very widespread - e.g. in the case of incremental interpretability/ontology research, it's mostly the interpretability tools which need memetic reach, not e.g. theory around grain-of-truth or value drift.

I think I have a somewhat different model of the alignment knowledge tree. From my perspective, the research I'm doing is already paradigmatic. I have a solid-enough paradigm, inside which there are many open problems, and what we need is a bunch of people chipping away at these open problems. Admittedly, the size of this "bunch" is still closer to 10 people than to 1000 people but (i) it's possible that the open problems will keep multiplying hydra-style, as often happens in math and (ii) memetic fitness would help getting the very best 10 people to do it.

It's also likely that there will be a "phase II" where the nature of the necessary research becomes very different (e.g. it might involve combining the new theory with neuroscience, or experimental ML research, or hardware engineering), and successful transition to this phase might require getting a lot of new people on board which would also be a lot easier given memetic fitness.

Principles of Privacy for Alignment Research

[For the record, here's previous relevant discussion]

My problem with the "nobody cares" model is that it seems self-defeating. First, if nobody cares about my work, then how would my work help with alignment? I don't put a lot of stock into building aligned AGI in the basement on the my own. (And not only because I don't have a basement.) Therefore, any impact I will have flows through my work becoming sufficiently known that somebody who builds AGI ends up using it. Even if I optimistically assume that I will personally be part of that project, my work needs to be sufficiently well-known to attract money and talent to make such a project possible.

Second, I also don't put a lot of stock into solving alignment all by myself. Therefore, other people need to build on my work. In theory, this only requires it to be well-known in the alignment community. But, to improve our chances of solving the problem we need to make the alignment community bigger. We want to attract more talent, much of which is found in the broader computer science community. This is in direct opposition to preserving the conditions for "nobody cares".

Third, a lot of people are motivated by fame and status (myself included). Therefore, bringing talent into alignment requires the fame and status to be achievable inside the field. This is obviously also in contradiction with "nobody cares".

My own thinking about this is: yes, progress in the problems I'm working on can contribute to capability research, but the overall chance of success on the pathway "capability advances driven by theoretical insights" is higher than on the pathway "capability advances driven by trial and error", even if the first leads to AGI sooner, especially if these theoretical insights are also useful for alignment. I certainly don't want to encourage the use of my work to advance capability, and I try to discourage anyone who would listen, but I accept the inevitable risk of that happening in exchange for the benefits.

Then again, I'm by no means confident that I'm thinking about all of this in the right way.

Human values & biases are inaccessible to the genome

I think that "directly specified" is just an ill-defined concept. You can ask whether A specifies B using encoding C. But if you don't fix C? Then any A can be said to "specify" any B (you can always put the information into C). Algorithmic information theory might come to the rescue by rephrasing the question as: "what is the relative Kolmogorov complexity K(B|A)?" Here, however, we have more ground to stand on, namely there is some function where is the space of genomes, is the space of environments and is the space of brains. Also we might be interested in a particular property of the brain, which we can think of as a function , for example might be something about values and/or biases. We can then ask e.g. how much mutual information is there between and vs. between and . Or, we can ask what is more difficult: changing by changing or by changing . Where the amount of "difficulty" can be measured by e.g. what fraction of inputs produce the desired output.

So, there are certainly questions that can be asked about, what information comes from the genome and what information comes from the environment. I'm not sure whether this is what you're going for, or you imagine some notion of information that comes from neither (but I have no idea what would that mean)? In any case, I think your thesis would benefit if you specified it more precisely. Given such a specification, it would be possible to assess the evidence more carefully.

Load More