All of Vanessa Kosoy's Comments + Replies

I don't think the argument on hacking relied on the ability to formally verify systems. Formally verified systems could potentially skew the balance of power to the defender side, but even if they don't exist, I don't think balance is completely skewed to the attacker.

My point was not about the defender/attacker balance. My point was that even short-term goals can be difficult to specify, which undermines the notion that we can easily empower ourselves by short-term AI.

Of course we need to understand how to define "long term" and "short term" here. O

... (read more)

Thanks for the responses Boaz!

Our claim is that one can separate out components - there is the predictable component which is non stationary, but is best approximated with a relatively simple baseline, and the chaotic component, which over the long run is just noise.In general, highly complex rules are more sensitive to noise (in fact, there are theorems along these lines in the field of Analysis of Boolean Functions), and so in the long run, the simpler component will dominate the accuracy.

I will look into analysis of boolean functions, thank you. How... (read more)

2boazbarak5d
Hi Vanesssa, Perhaps given my short-term preference, it's not surprising that I find it hard to track very deep comment threads, but let me just give a couple of short responses. I don't think the argument on hacking relied on the ability to formally verify systems. Formally verified systems could potentially skew the balance of power to the defender side, but even if they don't exist, I don't think balance is completely skewed to the attacker. You could imagine that, like today, there is a "cat and mouse" game, where both attackers and defenders try to find "zero day vulnerabilities" and exploit (in one case) or fix (in the other). I believe that in the world of powerful AI, this game would continue, with both sides having access to AI tools, which would empower both but not necessarily shift the balance to one or the other. I think the question of whether a long-term planning agent could emerge from short-term training is a very interesting technical question! Of course we need to understand how to define "long term" and "short term" here. One way to think about this is the following: we can define various short-term metrics, which are evaluable using information in the short-term, and potentially correlated with long-term success. We would say that a strategy is purely long-term if it cannot be explained by making advances on any combination of these metrics.

IIUC the thesis of this article rest on several interrelated claims:

  1. Long-term planning is not useful because of chaos
  2. Short-term AIs have no alignment problem
  3. Among humans, skill is not important for leadership, beyond some point
  4. Human brains have an advantage w.r.t. animals because of "universality", and any further advantage can only come from scaling with resources.

I wish to address these claims one by one.

Claim 1

This is an erroneous application of chaos theory IMO. The core observation of chaos theory is, that in many dynamical systems with compa... (read more)

3boazbarak7d
Hi Vanessa, Let me try to respond (note the claim numbers below are not the same as in the essay, but rather as in Vanessa's comment): Claim 1: Our claim is that one can separate out components - there is the predictable component which is non stationary, but is best approximated with a relatively simple baseline, and the chaotic component, which over the long run is just noise.In general, highly complex rules are more sensitive to noise (in fact, there are theorems along these lines in the field of Analysis of Boolean Functions [https://arxiv.org/abs/2105.10386]), and so in the long run, the simpler component will dominate the accuracy. Claim 2: Hacking is actually a fairly well-specified endeavor. People catalog, score, and classify security vulnerabilities [https://cve.mitre.org/]. To hack would be to come up with a security vulnerability, and exploit code, which can be verified. Also, you seem to be envisioning a long-term AI that is then fine-tuned on a short-term task, but how did it evolve these long-term goals in the first place? Claim 3: I would not say that there is no such thing as talent in being a CEO or presidents. I do however believe that the best leaders have been some combination of their particular characteristics and talents, and the situation they were in. Steve Jobs has led Apple to become the largest company in the world, but it is not clear that he is a "universal CEO" that would have done as good in any company (indeed he failed with NeXT). Similarly, Abraham Lincoln is typically ranked as the best U.S. president by historians, but again I think most would agree that he fit well the challenge that he had to face, rather than being someone that would have just as well handled the cold war or the 1970s energy crisis. Also, as Yafah points elsewhere here, for people to actually trust an AI with being the leader of a company or a country, it would need to not just be as good as humans or a little better, but better by a huge margin. In f

Even if we did make a goal program, it's still unknown how to build an AGI that is motivated to compute it, or to follow the goals it outputs.

Actually, it is (to a 0th approximation) known how to build an AGI that is motivated to compute it: use infra-Bayesian physicalism. The loss function in IBP already has the semantics "which programs should run". Following the goal it outputs is also formalizable within IBP, but even without this step we can just have utopia inside the goal program itself[1].


  1. We should be careful to prevent the inhabitants of th

... (read more)

P.S.

I think that in your example, if a person is given a button that can save a person on a different planet from being tortured, they will have a direct incentive to press the button, because the button is a causal connection in itself, and consciously reasoning about the person on the other planet is a causal[1] connection in the other direction. That said, a person still has a limited budget of such causal connections (you cannot reason about a group of arbitrarily many people, with fixed non-zero amount of paying attention to the individual details of ... (read more)

I'm curious what is the evidence you see that this is false as a description of the values of just about every human, given that

  • I, a human [citation needed] tell you that this seems to be a description of my values.
  • Almost every culture that ever existed had norms that prioritized helping family, friends and neighbors over helping random strangers, not to mention strangers that you never met.
  • Most people don't do much to help random strangers they never met, with the notable exception of effective altruists, but even most effective altruists only go that
... (read more)

First, you can consider preferences that are impartial but sublinear in the number of people. So, you can disagree with Nate's room analogy without the premise "stuff only matters if it adds to my own life and experiences".

Second, my preferences are indeed partial. But even that doesn't mean "stuff only matters if it adds to my own life and experiences". I do think that stuff only matters (to me) if it's in some sense causally connected to my life and experiences. More details here.

Third, I don't know what do you mean by "good". The questions that I unders... (read more)

2Rob Bensinger1mo
Yeah, I'm also talking about question 1. Seems obviously false as a description of my values (and, I'd guess, just about every human's). Consider the simple example of a universe that consists of two planets: mine, and another person's. We don't have spaceships, so we can't interact. I am not therefore indifferent to whether the other person is being horribly tortured for thousands of years. If I spontaneously consider the hypothetical, I will very strongly prefer that my neighbor not be tortured. If we add the claims that I can't affect it and can't ever know about it, I don't suddenly go "Oh, never mind, fuck that guy". Stuff that happens to other people is real, even if I don't interact with it.

and, i'd guess that one big universe is more than twice as Fun as two small universes, so even if there were no transaction costs it wouldn't be worth it. (humans can have more fun when there's two people in the same room, than one person each in two separate rooms.)

This sounds astronomically wrong to me. I think that my personal utility function gets close to saturation with a tiny fraction of the resources in universe-shard. Two people is one room is better than two people in separate rooms, yes. But, two rooms with trillion people each is virtually t... (read more)

But, two rooms with trillion people each is virtually the same as one room with two trillion. The returns on interactions with additional people fall off exponentially past the Dunbar number.

You're conflating "would I enjoy interacting with X?" with "is it good for X to exist?". Which is almost understandable given that Nate used the "two people can have more fun in the same room" example to illustrate why utility isn't linear in population. But this comment has an IMO bizarre amount of agreekarma (26 net agreement, with 11 votes), which makes me wonder if... (read more)

There's also the ALTER prize for progress on the learning-theoretic agenda.

Yes, absolutely! The contest is not a publication venue.

A major impediment in applying RL theory to any realistic scenario is that even the control problem[1] is intractable when the state space is exponentially large (in general). Real-life agents probably overcome this problem by exploiting some special properties of real-life environments. Here are two strong candidates for such properties:

  • In real life, processes can often be modeled as made of independent co-existing parts. For example, if I need to decide on my exercise routine for the next month and also on my research goals for the next month, the two
... (read more)

A question that often comes up in discussion of IRL: are agency and values purely behavioral concepts, or do they depend on how the system produces its behavior? The cartesian measure of agency I proposed seems purely behavioral, since it only depends on the policy. The physicalist version seems less so since it depends on the source code, but this difference might be minor: this role of the source is merely telling the agent "where" it is in the universe. However, on closer examination, the physicalist is far from purely behaviorist, and this is true e... (read more)

The spectrum you're describing is related, I think, to the spectrum that appears in the AIT definition of agency where there is dependence on the cost of computational resources. This means that the same system can appear agentic from a resource-scarce perspective but non-agentic from a resource-abundant perspective. The former then corresponds to the Vingean regime and the latter to the predictable regime. However, the framework does have a notion of prior and not just utility, so it is possible to ascribe beliefs to Vingean agents. I think it makes sense... (read more)

Causality in IBP

There seems to be an even more elegant way to define causal relationships between agents, or more generally between programs. Starting from a hypothesis , for , we consider its bridge transform . Given some subset of programs we can define then project to [1]. We can then take bridge transform again to get some . The factor now tells us which programs causally affect the manifestation of programs in . Notice that by Proposition 2.8 in the IBP article, when we just get all pro... (read more)

The problem of future unaligned AI leaking into human imitation is something I wrote about before. Notice that IDA-style recursion help a lot, because instead of simulating a process going deep into the external timeline's future, you're simulating a "groundhog day" where the researcher wakes up over and over at the same external time (more realistically, the restart time is drifting forward with the time outside the simulation) with a written record of all their previous work (but no memory of it). There can still be a problem if there is a positive proba... (read more)

I think it's a terrible idea to automatically adopt an equilibrium notion which incentivises the players to come up with increasingly nasty threats as fallback if they don't get their way. And so there seems to be a good chunk of remaining work to be done, involving poking more carefully at the CoCo value and seeing which assumptions going into it can be broken.

I'm not convinced there is any real problem here. The intuitive negative reaction we have to this "ugliness" is because of (i) empathy and (ii) morality. Empathy is just a part of the utility fun... (read more)

This is a fascinating result, but there is a caveat worth noting. When we say that e.g. AlphaGo is "superhuman at go" we are comparing it humans who (i) spent years training on the task and (ii) were selected for being the best at it among a sizable population. On the other hand, with next token prediction we're nowhere near that amount of optimization on the human side. (That said, I also agree that optimizing a model on next token prediction is very different from optimizing it for text coherence would be, if we could accomplish the latter.)

3Buck Shlegeris4mo
Yeah, I agree that it would be kind of interesting to see how good humans would get at this if it was a competitive sport. I still think my guess is that the best humans would be worse than GPT-3, and I'm unsure if they're worse than GPT-2. (There's no limit on anyone spending a bunch of time practicing this game, if for some reason someone gets really into it I'd enjoy hearing about the results.)

The short answer is, I don't know.

The long answer is, here are some possibilities, roughly ordered from "boring" to "weird":

  1. The framework is wrong.
  2. The framework is incomplete, there is some extension which gets rid of monotonicity. There are some obvious ways to make such extensions, but they look uglier and without further research it's hard to say whether they break important things or not.
  3. Humans are just not physicalist agents, you're not supposed to model them using this framework, even if this framework can be useful for AI. This is why humans too
... (read more)

The problem is that if implies that creates but you consider a counterfactual in which doesn't create then you get an inconsistent hypothesis i.e. a HUC which contains only 0. It is not clear what to do with that. In other words, the usual way of defining counterfactuals in IB (I tentatively named it "hard counterfactuals") only makes sense when the condition you're counterfactualizing on is something you have Knightian uncertainty about (which seems safe to assume if this condition is about your own future action but not safe to assume in genera... (read more)

it would be the best possible model of this type, at the task of language modeling on data sampled from the same distribution as MassiveText

Transformers a Turing complete, so "model of this type" is not much of a constraint. On the other hand, I guess it's theoretically possible that some weight matrices are inaccessible to current training algorithms no matter how much compute and data we have. It seems also possible that the scaling law doesn't go on forever, but phase-transitions somewhere (maybe very far) to a new trend which goes below the "irreducible" term.

Master post for ideas about infra-Bayesian physicalism.

Other relevant posts:

Telephone Theorem, Redundancy/Resampling, and Maxent for the math, Chaos for the concepts.

Thank you!

Just because something can be learned efficiently doesn't mean it's convergent for a wide variety of cognitive systems.

I believe that the relevant cognitive systems all look like learning algorithms for a prior of certain fairly specific type. I don't know how this prior looks like, but it's something very rich on the one hand and efficiently learnable on the other hand. So, if you showed that your formalism naturally produces priors that seem closer ... (read more)

As I see it, the core theory of natural abstractions is now 80% nailed down

Question 1: What's the minimal set of articles one should read to understand this 80%?

Question/Remark 2: AFAICT, your theory has a major missing piece, which is, proving that "abstraction" (formalized according to your way of formalizing it) of is actually a crucial ingredient of learning/cognition. The way I see it, such a proof should be by demonstrating that hypothesis classes defined in terms of probabilistic graph models / abstraction hierarchies can be learned with good sa... (read more)

5johnswentworth4mo
Telephone Theorem [https://www.lesswrong.com/posts/jJf4FrfiQdDGg7uco/the-telephone-theorem-information-at-a-distance-is-mediated] , Redundancy/Resampling [https://www.lesswrong.com/posts/vvEebH5jEvxnJEvBC/abstractions-as-redundant-information] , and Maxent [https://www.lesswrong.com/posts/cqdDGuTs2NamtEhBW/maxent-and-abstractions-current-best-arguments] for the math, Chaos [https://www.lesswrong.com/posts/zcCtQWQZwTzGmmteE/chaos-induces-abstractions] for the concepts. If we want to show that abstraction is a crucial ingredient of learning/cognition, then "Can we efficiently learn hypothesis classes defined in terms of abstraction hierarchies, as captured by John's formalism?" is entirely the wrong question. Just because something can be learned efficiently doesn't mean it's convergent for a wide variety of cognitive systems. And even if such hypothesis classes couldn't be learned efficiently in full generality, it would still be possible for a subset of that hypothesis class to be convergent for a wide variety of cognitive systems, in which case general properties of the hypothesis class would still apply to those systems' cognition. The question we actually want here is "Is abstraction, as captured by John's formalism, instrumentally convergent for a wide variety of cognitive systems?". And that question is indeed not yet definitively answered. The pragmascope itself would largely allow us to answer that question empirically, and I expect the ability to answer it empirically will quickly lead to proofs as well.

Our work doesn't necessarily need wide memetic spread to be found by the people who know what to look for. E.g. people playing through the alignment game tree are a lot more likely to realize that ontology identification, grain-of-truth, value drift, etc, are key questions to ask, whereas ML researchers just pushing toward AGI are a lot less likely to ask those questions.

That's a valid argument, but I can also imagine groups that (i) in a world where alignment research is obscure proceed to create unaligned AGI (ii) in a world where alignment research i... (read more)

[For the record, here's previous relevant discussion]

My problem with the "nobody cares" model is that it seems self-defeating. First, if nobody cares about my work, then how would my work help with alignment? I don't put a lot of stock into building aligned AGI in the basement on the my own. (And not only because I don't have a basement.) Therefore, any impact I will have flows through my work becoming sufficiently known that somebody who builds AGI ends up using it. Even if I optimistically assume that I will personally be part of that project, my work ne... (read more)

3johnswentworth4mo
Our work doesn't necessarily need wide memetic spread to be found by the people who know what to look for. E.g. people playing through the alignment game tree are a lot more likely to realize that ontology identification, grain-of-truth, value drift, etc, are key questions to ask, whereas ML researchers just pushing toward AGI are a lot less likely to ask those questions. I do agree that a growing alignment community will add memetic fitness to alignment work in general, which is at least somewhat problematic for the "nobody cares" model. And I do expect there to be at least some steps which need a fairly large alignment community doing "normal" (i.e. paradigmatic) incremental research. For instance, on some paths we need lots of people doing incremental interpretability/ontology research to link up lots of concepts to their representations in a trained system. On the other hand, not all of the foundations need to be very widespread - e.g. in the case of incremental interpretability/ontology research, it's mostly the interpretability tools which need memetic reach, not e.g. theory around grain-of-truth or value drift.

I think that "directly specified" is just an ill-defined concept. You can ask whether A specifies B using encoding C. But if you don't fix C? Then any A can be said to "specify" any B (you can always put the information into C). Algorithmic information theory might come to the rescue by rephrasing the question as: "what is the relative Kolmogorov complexity K(B|A)?" Here, however, we have more ground to stand on, namely there is some function where is the space of genomes, is the space of environments and is the space of brains. Also we might... (read more)

Well, how do you define "directly specified"? If human brains reliably converge towards a certain algorithm, then effectively this algorithm is specified by the genome. The real question is, which parts depends only on genes and which parts depend on the environment. My tentative opinion is that the majority is in the genes, since humans are, broadly speaking, pretty similar to each other. One environment effect is, feral humans grow up with serious mental problems. But, my guess is, this is not because of missing "values" or "biases", but (to 1st approxim... (read more)

2Alex Turner4mo
I don't classify "convergently learned" as an instance of "directly specified", but rather "indirectly specified, in conjunction with the requisite environmental data." Here's an example. I think that humans' reliably-learned edge detectors in V1 are not "directly specified", in the same way that vision models don't have directly specified curve detectors, but these detectors are convergently learned in order to do well on vision tasks. If I say "sunk cost is directly specified", I mean something like "the genome specifies neural circuitry which will eventually, in situations where sunk cost arises, fire so as to influence decision-making." However, if, for example, the genome lays out the macrostructure of the connectome and the broad-scale learning process and some reward circuitry and regional learning hyperparameters and some other details, and then this brain eventually comes to implement a sunk-cost bias, I don't call that "direct specification." I wish I had been more explicit about "direct specification", and perhaps this comment is still not clear. Please let me know if so!

My reasoning can be roughly described as:

  • There is a simple mathematical theory of agency, similarly to how there is are simple mathematical theories of e.g. probability of computational complexity
  • This theory should include, explaining how agents can have goals defined not in terms of sensory data
  • I have a current best guess to what the outline of this theory looks like, based on (i) simplicity (ii) satisfying natural-seeming desiderata and (iii) ability to prove relevant non-trivial theorems (for example, infra-Bayesian reinforcement learning theory is
... (read more)
4Quintin Pope5mo
I'd note that it's possible for an organism to learn to behave (and think) in accordance with the "simple mathematical theory of agency" you're talking about, without said theory being directly specified by the genome. If the theory of agency really is computationally simple, then many learning processes probably converge towards implementing something like that theory, simply as a result of being optimized to act coherently in an environment over time.

I think the way it works is approximately as follows. There is a fixed "ontological" infra-POMDP which is a coarse hard-coded world-model sufficient to define the concepts on which the reward depends (for humans, it would includes concepts such as "other humans"). Then there is a prior which is composed of refinements of this infra-POMDP. The reward depends on state of the ontological IPOMDP, so it is allowed to depend on the concepts of the hard-cord world-model (but not on the concepts which only exist in the refined models). Ofc, this leaves open the qu... (read more)

3Alex Turner5mo
Without knowing the details of infra-POMDPs or your other work, by what Bayesian evidence do you raise this particular hypothesis to consideration [https://www.readthesequences.com/Privileging-The-Hypothesis]? (I say this not to imply that you do not have such evidence, only that I do not presently see why I should consider this particular hypothesis.)

Here's a video of a talk I gave about PreDCA.

Something like Bayesian/expected utility maximization seems useful for understanding agents and agency. However, there is the problem that expected utility theory doesn’t seem to predict anything in particular. We want a better response to “Expected utility theory doesn’t predict anything” that can describe the insight of EU theory re what agents are without being misinterpreted / without failing to constrain expectations at all technically.

Agents are policies with a high value of g. So, "EU theory" does "predict" something, although it's a "soft" prediction (i.e. agency is a matter of degree).

When I say "policy", I mean the entire behavior including the learning algorithm, not some asymptotic behavior the system is converging to. Obviously, the policy is represented as genetic code, not as individual decisions. When I say "evolution is directly selecting the policy", I mean that genotypes are selected based on their "expected reward" (reproductive fitness) rather than e.g. by evaluating the accuracy of the world-models those minds produce[1]. And, genotypes are not a priori constrained to be learning algorithms with particular architectures, th... (read more)

The word "bounded" in "bounded simplicity prior" referred to bounded computational resources. A "bounded simplicity prior" is a prior which involves either a "hard" (i.e. some hypotheses are excluded) or a "soft" (i.e. some hypotheses are down-weighted) bound on computational resources (or both), and also inductive bias towards simplicity (specifically it should probably behave as ~ 2^{-description complexity}). For a concrete example, see the prior I described here (w/o any claim to originality).

2Richard Ngo5mo
Ah, I see. That makes sense now!

Not quite sure what you're saying here. Is the claim that speed penalties would help shift the balance against mesa-optimizers? This kind of solutions are worth looking into, but I'm not too optimistic about them atm. First, the mesa-optimizer probably won't add a lot of overhead compared to the considerable complexity of emulating a brain. In particular, it need not work by anything like our own ML algorithms. So, if it's possible to rule out mesa-optimizers like this, it would require a rather extreme penalty. Second, there are limits on how much you can... (read more)

2Richard Ngo5mo
No, I wasn't advocating adding a speed penalty, I was just pointing at a reason to think that a speed prior would give a more accurate answer to the question of "which is favored" than the bounded simplicity prior you're assuming: But now I realise that I don't understand why you think this is true of transformers. Could you explain? It seems to me that there are many very simple hypotheses which take a long time to calculate, and which transformers therefore can't be representing.

Epistemic status: some of these ideas only crystallized today, normally I would take at least a few days to process before posting to make sure there are no glaring holes in the reasoning, but I saw this thread and decided to reply since it's topical.

Suppose that your imitator works by something akin to Bayesian inference with some sort of bounded simplicity prior (I think it's true of transformers). In order for Bayesian inference to converge to exact imitation, you usually need realizability. Obviously today we don't have realizability because the ANNs c... (read more)

2Richard Ngo5mo
In a deep learning context, the latter hypothesis seems much more heavily favored when using a simplicity prior (since gradient descent is simple to specify) than a speed prior (since gradient descent takes a lot of computation). So as long as the compute costs of inference remain smaller than the compute costs of training, a speed prior seems more appropriate for evaluating how easily hypotheses can become more epistemically sophisticated than the outer loop.

I want to outline how my research programme attempts to address this core difficulty.

First, like I noted before, evolution is not a perfect analogy for AI. This is because evolution is directly selecting the policy, whereas a (model-based) AI system is separately selecting (i) a world-model (ii) a reward function and (iii) a plan (policy) based on i+ii. This inherently produces better generalization-of-alignment (but not nearly enough to solve the problem).

With iii, we have the least generalization problems, because we are not limited by training data: the... (read more)

1Alex Turner5mo
Huh? Evolution did not directly select over human policy decisions. Evolution specified brains, which do within-lifetime learning and therefore learn different policies given different upbringings, and e.g. learning rate mutations indirectly leads to statistical differences in human learned policies. Evolution probably specifies some reward circuitry, the learning architecture, the broad-strokes learning processes (self-supervised predictive + RL), and some other factors, from which the policy is produced. The IGF->human values analogy is indeed relevantly misleading IMO, but not for this reason.

My 0th approximation answer is: you're describing something logically incoherent, like a p-zombie.

My 1st approximation answer is more nuanced. Words that, in the pre-Turing era, referred exclusively to humans (and sometimes animals, and fictional beings), such as "wants", "experiences" et cetera, might have two different referents. One referent is a natural concept, something tied into deep truths about how the universe (or multiverse) works. In particular, deep truths about the "relatively simple core structure that explains why complicated cognitive mach... (read more)

Infra-Bayesian physicalism is an interesting example in favor of the thesis that the more qualitatively capable an agent is, the less corrigible it is. (a.k.a. "corrigibility is anti-natural to consequentialist reasoning"). Specifically, alignment protocols that don't rely on value learning become vastly less safe when combined with IBP:

  • Example 1: Using steep time discount to disincentivize dangerous long-term plans. For IBP, "steep time discount" just means, predominantly caring about your source code running with particular short inputs. Such a goal s

... (read more)

Second, the only reason why the question "what X wants" can make sense at all, is because X is an agent. As a corollary, it only makes sense to the extent that X is an agent.

I'm not sure this is true; or if it's true, I'm not sure it's relevant.

If we go down that path then it becomes the sort of conversation where I have no idea what common assumptions do we have, if any, that we could use to agree. As a general rule, I find it unconstructive, for the purpose of trying to agree on anything, to say things like "this (intuitively compelling) assumpti... (read more)

1Rob Bensinger6mo
Fair enough! I don't think I agree in general, but I think 'OK, but what's your alternative to agency?' is an especially good case for this heuristic. The first counter-example that popped into my head was "a mind that lacks any machinery for considering, evaluating, or selecting actions; but it does have machinery for experiencing more-pleasurable vs. less pleasurable states". This is a mind we should be able to build, even if it would never evolve naturally. Possibly this still qualifies as an "agent" that "wants" and "pursues" things, as you conceive it, even though it doesn't select actions?

Humans are at least a little coherent, or we would never get anything done; but we aren't very coherent, so the project of piecing together 'what does the human brain as a whole "want"' can be vastly more difficult than the problem of figuring out what a coherent optimizer wants.

This is a point where I feel like I do have a substantial disagreement with the "conventional wisdom" of LessWrong.

First, LessWrong began with a discussion of cognitive biases in human irrationality, so this naturally became a staple of the local narrative. On the other hand, I ... (read more)

Second, the only reason why the question "what X wants" can make sense at all, is because X is an agent. As a corollary, it only makes sense to the extent that X is an agent.

I'm not sure this is true; or if it's true, I'm not sure it's relevant. But assuming it is true...

Therefore, if X is not entirely coherent then X's preferences are only approximately defined, and hence we only need to infer them approximately.

... this strikes me as not capturing the aspect of human values that looks strange and complicated. Two ways I could imagine the strangeness and ... (read more)

First, some remarks about the meta-level:

The ability to do new basic work noticing and fixing those flaws is the same ability as the ability to write this document before I published it, which nobody apparently did, despite my having had other things to do than write this up for the last five years or so. Some of that silence may, possibly, optimistically, be due to nobody else in this field having the ability to write things comprehensibly - such that somebody out there had the knowledge to write all of this themselves, if they could only have written

... (read more)

There is a big chunk of what you're trying to teach which not weird and complicated, namely: "find this other agent, and what their values are". Because, "agents" and "values" are natural concepts, for reasons strongly related to "there's a relatively simple core structure that explains why complicated cognitive machines work".

This seems like it must be true to some degree, but "there is a big chunk" feels a bit too strong to me.

Possibly we don't disagree, and just have different notions of what a "big chunk" is. But some things that make the chunk feel sm... (read more)

Maybe corrigibility / task AGI / etc. is harder than CEV, but it just doesn't seem realistic to me to try to achieve full, up-and-running CEV with the very first AGI systems you build, within a few months or a few years of humanity figuring out how to build AGI at all.

The way I imagine the win scenario is, we're going to make a lot of progress in understanding alignment before we know how to build AGI. And, we're going to do it by prioritizing understanding alignment modulo capability (the two are not really possible to cleanly separate, but it might be... (read more)

1Rob Bensinger6mo
The latter, as I was imagining "95%".

I agree that it's a tricky problem, but I think it's probably tractable. The way PreDCA tries to deal with these difficulties is:

  • The AI can tell that, even before the AI was turned on, the physical universe was running certain programs.
  • Some of those programs are "agentic" programs.
  • Agentic programs have approximately well-defined utility functions.
  • Disassembling the humans doesn't change anything, since it doesn't affect the programs that were already running[1] before the AI was turned on.
  • Since we're looking at agent-programs rather than specific agen
... (read more)

Before humanity gets to steps 1-2 ('use CEV or something to make the long-term future awesome'), it needs to get past steps 3-6 ('use limited task AGI to ensure that humanity doesn't kill itself with AGI so we can proceed to take our time with far harder problems like "what even is CEV" and "how even in principle would one get an AI system to robustly do anything remotely like that, without some subtle or not-so-subtle disaster resulting"').

I want to register my skepticism about this claim. Whereas it might naively seem that "put a strawberry on a plate... (read more)

2[comment deleted]6mo

And if humans had a utility function and we knew what that utility function was, we would not need CEV.  Unfortunately extracting human preferences over out-of-distribution options and outcomes at dangerously high intelligence, using data gathered at safe levels of intelligence and a correspondingly narrower range of outcomes and options, when there exists no sensory ground truth about what humans want because human raters can be fooled or disassembled, seems pretty complicated.  There is ultimately a rescuable truth about what we want, and CEV i... (read more)

1Rob Bensinger6mo
Yeah, I'm very interested in hearing counter-arguments to claims like this. I'll say that although I think task AGI is easier, it's not necessarily strictly easier, for the reason you mentioned. Maybe a cruxier way of putting my claim is: Maybe corrigibility / task AGI / etc. is harder than CEV, but it just doesn't seem realistic to me to try to achieve full, up-and-running CEV with the very first AGI systems you build, within a few months or a few years of humanity figuring out how to build AGI at all. And I do think you need to get CEV up and running within a few months or a few years, if you want to both (1) avoid someone else destroying the world first, and (2) not use a "strawberry-aligned" AGI to prevent 1 from happening. All of the options are to some extent a gamble, but corrigibility, task AGI, limited impact, etc. strike me as gambles that could actually realistically work out well for humanity even under extreme time pressure to deploy a system within a year or two of 'we figure out how to build AGI'. I don't think CEV is possible under that constraint. (And rushing CEV and getting it only 95% correct poses far larger s-risks than rushing low-impact non-operator-modeling strawberry AGI and getting it only 95% correct.)

Here's a question inspired by thinking about Turing RL, and trying to understand what kind of "beliefs about computations" should we expect the agent to acquire.

Does mathematics have finite information content?

First, let's focus on computable mathematics. At first glance, the answer seems obviously "no": because of the halting problem, there's no algorithm (i.e. a Turing machine that always terminates) which can predict the result of every computation. Therefore, you can keep learning new facts about results of computations forever. BUT, maybe most of thos... (read more)

3Alex Mennen7mo
Wikipedia claims [https://en.wikipedia.org/wiki/Algorithmically_random_sequence#Properties_and_examples_of_Martin-L%C3%B6f_random_sequences] that every sequence is Turing reducible to a random one, giving a positive answer to the non-resource-bounded version of any question of this form. There might be a resource-bounded version of this result as well, but I'm not sure.

Two more remarks.

User Detection

It can be useful to identify and assist specifically the user rather than e.g. any human that ever lived (and maybe some hominids). For this purpose I propose the following method. It also strengthens the protocol by relieving some pressure from other classification criteria.

Given two agents and , which can ask which points on 's timeline are in the causal past of which points of 's timeline. To answer this, consider the counterfactual in which takes a random action (or sequence of actions) at some point (or interval... (read more)

2Vanessa Kosoy3mo
CAUSALITY IN IBP There seems to be an even more elegant way to define causal relationships between agents, or more generally between programs. Starting from a hypothesis Θ ∈□(Γ×Φ), for Γ=ΣR, we consider its bridge transform B∈□(Γ×2Γ×Φ). Given some subset of programs Q⊆R we can define Δ:=ΣQ then project B to BΔ∈□(Γ×2Δ)[1] [#fn-aYukbEeecEBL3LQqb-1]. We can then take bridge transform again to get some C ∈□(Γ×2Γ×2Δ). The 2Γ factor now tells us which programs causally affect the manifestation of programs in Q. Notice that by Proposition 2.8 in the IBP article, when Q=R we just get all programs that are running, which makes sense. AGREEMENT RULES OUT MESA-OPTIMIZATION The version of PreDCA without any explicit malign hypothesis filtering might be immune to malign hypotheses, and here is why. It seems plausible that IBP admits an agreement theorem (analogous to Aumann's) which informally amounts to the following: Given two agents Alice and Bobcat that (i) share the same physical universe, (ii) have a sufficiently tight causal relationship (each can see what the other sees), (iii) have unprivileged locations inside the physical universe, (iv) start from similar/compatible priors and (v) [maybe needed?] similar utility functions, they converge to similar/compatible beliefs, regardless of the complexity of translation between their subjective viewpoints. This is plausible because (i) as opposed to the cartesian framework, different bridge rules don't lead to different probabilities and (ii) if Bobcat considers a simulation hypothesis plausible, and the simulation is sufficiently detailed to fool it indefinitely, then the simulation contains a detailed simulation of Alice and hence Alice must also consider this to be plausible hypothesis. If the agreement conjecture is true, then the AI will converge to hypotheses that all contain the user, in a causal relationship with the AI that affirms them as the user. Moreover, those hypotheses will be compatible with the user's own po

All credit for this beautiful work goes to Alex.

Some additional thoughts.

Non-Cartesian Daemons

These are notoriously difficult to deal with. The only methods I know are that applicable to other protocols are homomorphic cryptography and quantilization of envelope (external computer) actions. But, in this protocol, they are dealt with the same as Cartesian daemons! At least if we assume a non-Cartesian attack requires an envelope action, the malign hypotheses which are would-be sources of such actions are discarded without giving an opportunity for attack.

Weaknesses

My main concerns with this approach ... (read more)

Precursor Detection, Classification and Assistance (PreDCA)

Infra-Bayesian physicalism provides us with two key building blocks:

  • Given a hypothesis about the universe, we can tell which programs are running. (This is just the bridge transform.)
  • Given a program, we can tell whether it is an agent, and if so, which utility function it has[1] (the "evaluating agent" section of the article).

I will now outline how we can use these building blocks to solve both the inner and outer alignment problem. The rough idea is:

  • For each hypothesis in the prior, check
... (read more)
2Vanessa Kosoy3mo
A question that often comes up in discussion of IRL: are agency and values purely behavioral concepts, or do they depend on how the system produces its behavior? The cartesian measure of agency I proposed [https://www.lesswrong.com/posts/dPmmuaz9szk26BkmD/shortform?commentId=ovBmi2QFikE6CRWtj] seems purely behavioral, since it only depends on the policy. The physicalist version seems less so since it depends on the source code, but this difference might be minor: this role of the source is merely telling the agent "where" it is in the universe. However, on closer examination, the physicalist g is far from purely behaviorist, and this is true even for cartesian Turing RL. Indeed, the policy describes not only the agent's interaction with the actual environment but also its interaction with the "envelope" computer. In a sense, the policy can be said to reflects the agent's "conscious thoughts". This means that specifying an agent requires not only specifying its source code but also the "envelope semantics" C (possibly we also need to penalize for the complexity of C in the definition of g). Identifying that an agent exists requires not only that its source code is running, but also, at least that its history h is C-consistent with the α∈2Γ variable of the bridge transform. That is, for any y∈α we must have dCy for some destiny d⊐h. In other words, we want any computation the agents ostensibly runs on the envelope to be one that is physically manifest (it might be this condition isn't sufficiently strong, since it doesn't seem to establish a causal relation between the manifesting and the agent's observations, but it's at least necessary). Notice also that the computational power of the envelope implied by C becomes another characteristic of the agent's intelligence, together with g as a function of the cost of computational resources. It might be useful to come up with natural ways to quantify this power.
2ViktoriaMalyasova4mo
Can you please explain how does this not match the definition? I don't yet understand all the math, but intuitively, if H creates G / doesn't interfere with the creation of G, then if H instead followed policy "do not create G/ do interfere with the creation of G", then G's code wouldn't run? Can you please give an example of a precursor that does match the definition?
2Vanessa Kosoy5mo
Here's a video [https://www.youtube.com/watch?v=24vIJDBSNRI] of a talk I gave about PreDCA.
2Vanessa Kosoy7mo
Two more remarks. USER DETECTION It can be useful to identify and assist specifically the user rather than e.g. any human that ever lived (and maybe some hominids). For this purpose I propose the following method. It also strengthens the protocol by relieving some pressure from other classification criteria. Given two agents G and H, which can ask which points on G's timeline are in the causal past of which points of H's timeline. To answer this, consider the counterfactual in which G takes a random action (or sequence of actions) at some point (or interval) on G's timeline, and measure the mutual information between this action(s) and H's observations at some interval on H's timeline. Using this, we can effectively construct a future "causal cone" emanating from the AI's origin, and also a past causal cone emanating from some time t on the AI's timeline. Then, "nearby" agents will meet the intersection of these cones for low values of t whereas "faraway" agents will only meet it for high values of t or not at all. To first approximation, the user would be the "nearest" precursor[1] [#fn-ZN7Zhqkk6GqFZdaJd-1] agent i.e. the one meeting the intersection for the minimal t. More precisely, we expect the user's observations to have nearly maximal mutual information with the AI's actions: the user can e.g. see every symbol the AI outputs to the display. However, the other direction is less clear: can the AI's sensors measure every nerve signal emanating from the user's brain? To address this, we can fix t to a value s.t. we expect only the user the meet the intersection of cones, and have the AI select the agent which meets this intersection for the highest mutual information threshold. This probably does not make the detection of malign agents redundant, since AFAICT a malign simulation hypothesis might be somehow cleverly arranged to make a malign agent the user. MORE ON COUNTERFACTUALS In the parent post I suggested "instead of examining only Θ we also examine co
2Vanessa Kosoy8mo
Some additional thoughts. NON-CARTESIAN DAEMONS [HTTPS://WWW.LESSWRONG.COM/POSTS/5BD75CC58225BF0670375575/THE-LEARNING-THEORETIC-AI-ALIGNMENT-RESEARCH-AGENDA#TAMING_DAEMONS] These are notoriously difficult to deal with. The only methods I know are that applicable to other protocols are homomorphic cryptography and quantilization of envelope (external computer) actions. But, in this protocol, they are dealt with the same as Cartesian daemons! At least if we assume a non-Cartesian attack requires an envelope action, the malign hypotheses which are would-be sources of such actions are discarded without giving an opportunity for attack. WEAKNESSES My main concerns with this approach are: * The possibility of major conceptual holes in the definition of precursors. More informal analysis can help, but ultimately mathematical research in infra-Bayesian physicalism in general and infra-Bayesian cartesian/physicalist multi-agent [https://www.lesswrong.com/posts/dPmmuaz9szk26BkmD/shortform?commentId=uZ5xq73xmZSTSZN33] interactions in particular is required to gain sufficient confidence. * The feasibility of a good enough classifier. At present, I don't have a concrete plan for attacking this, as it requires inputs from outside of computer science. * Inherent "incorrigibility": once the AI becomes sufficiently confident that it correctly detected and classified its precursors, its plans won't defer to the users any more than the resulting utility function demands. On the second hand, I think the concept of corrigibility is underspecified [https://www.lesswrong.com/posts/dPmmuaz9szk26BkmD/shortform?commentId=5Rxgkzqr8XsBwcEQB] so much that I'm not sure it is solved (rather than dissolved) even in the Book [https://www.lesswrong.com/posts/34Gkqus9vusXRevR8/late-2021-miri-conversations-ama-discussion?commentId=PYHHJkHcS55ekmWEE] . Moreover, the concern can be ameliorated by sufficiently powerful interpretabi
Load More