Vanessa Kosoy

AI alignment researcher supported by HUJI, MIRI and LTFF. Working on the learning-theoretic agenda.

E-mail: vanessa DOT kosoy AT {the thing reverse stupidity is not} DOT org

Wiki Contributions


The post is still largely up-to-date. In the intervening year, I mostly worked on the theory of regret bounds for infra-Bayesian bandits, and haven't made much progress on open problems in infra-Bayesian physicalism. On the other hand, I also haven't found any new problems with the framework.

The strongest objection to this formalism is the apparent contradiction between the monotonicity principle and the sort of preferences humans have. While my thinking about this problem evolved a little, I am still at a spot where every solution I know requires biting a strange philosophical bullet. On the other hand, IBP is still my best guess about naturalized induction, and, more generally, about the conjectured "attractor submanifold" in the space of minds, i.e. the type of mind to which all sufficiently advanced minds eventually converge.

One important development that did happen is my invention of the PreDCA alignment protocol, which critically depends on IBP. I consider PreDCA to be the most promising direction I know at present to solving alignment, and an important (informal) demonstration of the potential of the IBP formalism.

First, the notation makes no sense. The prior is over hypotheses, each of which is an element of . is the notation used to denote a single hypothesis.

Second, having a prior just over doesn't work since both the loss function and the counterfactuals depend on .

Third, the reason we don't just start with a prior over , is because it's important which prior we have. Arguably, the correct prior is the image of a simplicity prior over physicalist hypotheses by the bridge transform. But, come to think about it, it might be about the same as having a simplicity prior over , where each hypothesis is constrained to be invariant under the bridge transform (thanks to Proposition 2.8). So, maybe we can reformulate the framework to get rid of (but not of the bridge transform). Then again, finding the "ultimate prior" for general intelligence is a big open problem, and maybe in the end we will need to specify it with the help of .

Fourth, I wouldn't say that is supposed to solve the ontology identification problem. The way IBP solves the ontology identification problem is by asserting that is the correct ontology. And then there are tricks how to translate between other ontologies and this ontology (which is what section 3 is about).

deserves a little more credit than you give it. To interpret the claim correctly, we need to notice and are classes of decision problems, not classes of proof systems for decision problems. You demonstrate that for a fixed proof system it is possible that generating proofs is easier than verifying proofs. However, if we fix a decision problem and allow any valid (i.e. sound and complete) proof system, then verifying cannot be harder than generating. Indeed, let be some proof system and an algorithm for generating proofs (i.e. an algorithm that finds a proof if a proof exists and outputs "nope" otherwise). Then, we can construct another proof system , in which a "proof" is just the empty string and "verifying" a proof for problem instance consists of running and outputting "yes" if it found an -proof and "no" otherwise. Hence, verification in is no harder than generation in . Now, so far it's just , which is trivial. The non-trivial part is: there exist problems for which verification is tractable (in some proof system) while generation is intractable (in any proof system). Arguably there are even many such problems (an informal claim).

First, no, the AGI is not going to "employ complex heuristics to ever-better approximate optimal hypotheses update". The AGI is going to be based on an algorithm which, as a mathematical fact (if not proved then at least conjectured), converges to the correct hypothesis with high probability. Just like we can prove that e.g. SVMs converge to the optimal hypothesis in the respective class, or that particular RL algorithms for small MDPs converge to the correct hypothesis (assuming realizability).

Second, there's the issue of non-cartesian attacks ("hacking the computer"). Assuming that the core computing unit is not powerful enough to mount a non-cartesian attack on its own, such attacks can arguably be regarded as detrimental side-effects of running computations on the envelope. My hope is that we can shape the prior about such side-effects in some informed way (e.g. the vast majority of programs won't hack the computer) s.t. we still have approximate learnability (i.e. the system is not too afraid to run computations) without misspecification (i.e. the system is not overconfident about the safety of running computations). The more effort we put into hardening the system, the easier it should be to find such a sweet spot.

Third, I hope that the agreement solution will completely rule out any undesirable hypothesis, because we will have an actual theorem that guarantees it. What are the exact assumptions going to be and what needs to be done to make sure these assumptions hold is work for the future, ofc.

I don't think the argument on hacking relied on the ability to formally verify systems. Formally verified systems could potentially skew the balance of power to the defender side, but even if they don't exist, I don't think balance is completely skewed to the attacker.

My point was not about the defender/attacker balance. My point was that even short-term goals can be difficult to specify, which undermines the notion that we can easily empower ourselves by short-term AI.

Of course we need to understand how to define "long term" and "short term" here. One way to think about this is the following: we can define various short-term metrics, which are evaluable using information in the short-term, and potentially correlated with long-term success. We would say that a strategy is purely long-term if it cannot be explained by making advances on any combination of these metrics.

Sort of. The correct way to make it more rigorous, IMO, is using tools from algorithmic information theory, like I suggested here.

Thanks for the responses Boaz!

Our claim is that one can separate out components - there is the predictable component which is non stationary, but is best approximated with a relatively simple baseline, and the chaotic component, which over the long run is just noise.In general, highly complex rules are more sensitive to noise (in fact, there are theorems along these lines in the field of Analysis of Boolean Functions), and so in the long run, the simpler component will dominate the accuracy.

I will look into analysis of boolean functions, thank you. However, unless you want to make your claim more rigorous, it seems suspect to me.

In reality, there are processes happening simultaneously on many different timescales, from the microscopic to the cosmological. And, these processes are coupled, so that the current equilibrium of each process can be regarded as a control signal for the higher timescale processes. This means we can do long-term planning by starting from the long timescales and back-chaining to short timescales, like I began to formalize here.

So, while eventually the entire universe reaches an equilibrium state (a.k.a. heat-death), there is plenty of room for long-term planning before that.

Hacking is actually a fairly well-specified endeavor. People catalog, score, and classify security vulnerabilities. To hack would be to come up with a security vulnerability, and exploit code, which can be verified.

Yeeees, it does seem like hacking is an especially bad example. But even in this example, my position is quite defensible. Yes, theoretically you can formally specify the desired behavior of the code and verify that it always happens. But, there are two problems with that: First, for many realistic software system, the formal specification would require colossal effort. Second, the formal verification is only as good as the formal model. For example, if the attacker found a hardware exploit, while your model assumes idealized behavior for the hardware, the verification doesn't help. And, it domains outside software the situation is much worse: how do you "verify" that your biological security measures are fool-proof, for example?

Also, you seem to be envisioning a long-term AI that is then fine-tuned on a short-term task, but how did it evolve these long-term goals in the first place?

When you're selecting for success on a short-term goal you might inadvertently produce a long-term agent (which, on the training distribution, is viewing the short-term goal as instrumental for its own goals), just like how evolution was selecting for genetic fitness but ended up producing agents with many preferences unrelated to that. More speculatively, there might be systematic reasons for such agents to arise, for example if good performance in the real-world requires physicalist epistemology which comes with inherent "long-terminess".

I would not say that there is no such thing as talent in being a CEO or presidents. I do however believe that the best leaders have been some combination of their particular characteristics and talents, and the situation they were in. Steve Jobs has led Apple to become the largest company in the world, but it is not clear that he is a "universal CEO" that would have done as good in any company (indeed he failed with NeXT).

This sounds like a story you can tell about anything. "Yes, such-and-such mathematician proved a really brilliant theorem A, but their effort to make progress in B didn't amount to much." Obviously, real-world performance depends on circumstances and not only on talent. This is doubly true in a competitive setting, where other similarly talented people are working against you. Nevertheless, a sufficiently large gap in talent can produce very lopsided outcomes.

Also, as Yafah points elsewhere here, for people to actually trust an AI with being the leader of a company or a country, it would need to not just be as good as humans or a little better, but better by a huge margin. In fact, most people's initial suspicion is that AIs (or even humans that don't look like them) is not "aligned" with their interests, and if you don't convince them otherwise, their default would be to keep them from positions of power.

First, it is entirely possible the AI will be better by a huge margin, because like with most things, there's no reason to believe evolution brought us anywhere near the theoretical optimum on this. (Yes, there was selective pressure, but no amount of selective pressure allowed evolution to invent spaceships, or nuclear reactors, or even the wheel.) Second, what if the AI poses as a human? Or, what if the AI uses a human as a front while pulling the strings behind the scenes? There will be no lack of volunteers to work as such a front, if in the short-term them it brings them wealth and status. Also, ironically, the more successful AI risk skeptics are at swaying public opinion, the easier the AIs job is and the weaker their argument becomes.

The main point is that we need to measure the powers of a system as a whole, not compare the powers of an individual human with an individual AI. Clearly, if you took a human, made their memory capacity 10 times bigger, and made their speed 10 times faster, then they could do more things. But we are comparing with the case that humans will be assisted with short-term AIs that would help them in all of the tasks that are memory and speed intensive.

Alright, I can see how the "universality" argument makes sense if you believe that "human + short-term AI = scaled-up human". The part I doubt is that this equation holds for any easy-to-specify value of "short-term AI".

IIUC the thesis of this article rest on several interrelated claims:

  1. Long-term planning is not useful because of chaos
  2. Short-term AIs have no alignment problem
  3. Among humans, skill is not important for leadership, beyond some point
  4. Human brains have an advantage w.r.t. animals because of "universality", and any further advantage can only come from scaling with resources.

I wish to address these claims one by one.

Claim 1

This is an erroneous application of chaos theory IMO. The core observation of chaos theory is, that in many dynamical systems with compact phase space, any distribution converges (in the Kantorovich-Rubinstein sense) to a unique stationary distribution. This means that small measurement errors lead to large prediction errors, and in the limit no information from the initial condition remains.

However, real-world dynamical systems are often not compact in the relevant approximation. In particular, acquisition of resources and development of new technologies are not bounded from above on a relevant scale. Indeed, trends in GDP growth and technological progress continue over long time scales and haven't converged, so far, to a stationary distribution. Ultimately, these quantities are also bounded for physical / information-theoretic / complexity-theoretic reasons, but since humanity is pretty far from saturating them, this leaves ample room for AI to have a long-term planning advantage over humanity.

Claim 2

Although it is true that, for sufficiently short-term planning horizons, AIs have less incentives to produce unintended consequences, problems remain.

One problem is that some tasks are very difficult to specify. For example, suppose that a group of humans armed with short-term AIs is engaged in cyberwarfare against a long-term AI. Then, even if every important step in the conflict can be modeled as short-term optimization, specifying the correct short-term goal can be a non-trivial task (how do you define "to hack" or "to prevent from hacking"?) that humans can't easily point their short-term AI towards.

Moreover, AIs trained on short-term objectives can still display long-term optimization out-of-distribution. This is because a long-term optimizer that is smart enough to distinguish between training and deployment can behave according to expectations during training while violating them as much as it wants when it's either outside of training or the correcting outer loop is too slow to matter.

Claim 3

This claim flies so much in the face of common sense (is there no such thing as business acumen? charisma? military genius?) that it needs a lot more supporting evidence IMO. The mere fact that IQs of e.g. CEOs are only moderately above average and not far above average only means that IQ stops to be a useful metric at that range, since beyond some point, different people have cognitive advantages in different domains. I think that, as scientists, we need to be careful of cavalierly dismissing the sort of skills we don't have.

As to the skepticism of the authors about social manipulation, I think that anyone who studied history or politics can attest that social manipulation has been used, and continues to be used, with enormous effects. (Btw, I think it's probably not that hard to separate a dog from a bone or child from a toy if you're willing to e.g. be completely ruthless with intimidation.)

Claim 4

While it might be true that there is a sense in which human brains are "qualitatively optimal", this still leaves a lot of room for quantitative advantage, similar to how among two universal computers, one can be vastly more efficient than the other for practical purposes. As a more relevant analogy, we can think of two learning algorithms that learn the same class of hypotheses while still having a significant difference in computational and/or sample efficiency. In the limit of infinite resources and data, both algorithms converge to the same results, but in practice one still has a big advantage over the other. While undoubtedly there are hard limits to virtually every performance metric, there is no reason to believe evolution brought human brains anywhere near those limits. Furthermore, even if "scaling with resources" is the only thing that matters, the ability of AI to scale might be vastly better than the ability of humans to scale because of communication bandwidth bottlenecks between humans, not to mention the limited trust humans have towards one another (as opposed to large distributed AI systems, or disparate AI systems that can formally verify each other's trustworthiness).

Even if we did make a goal program, it's still unknown how to build an AGI that is motivated to compute it, or to follow the goals it outputs.

Actually, it is (to a 0th approximation) known how to build an AGI that is motivated to compute it: use infra-Bayesian physicalism. The loss function in IBP already has the semantics "which programs should run". Following the goal it outputs is also formalizable within IBP, but even without this step we can just have utopia inside the goal program itself[1].

  1. We should be careful to prevent the inhabitants of the virtual utopia from creating unaligned AI which eats the utopia. This sounds achievable, assuming the premise that we can actually construct such programs. ↩︎


I think that in your example, if a person is given a button that can save a person on a different planet from being tortured, they will have a direct incentive to press the button, because the button is a causal connection in itself, and consciously reasoning about the person on the other planet is a causal[1] connection in the other direction. That said, a person still has a limited budget of such causal connections (you cannot reason about a group of arbitrarily many people, with fixed non-zero amount of paying attention to the individual details of every person, in a fixed time-frame). Therefore, while the incentive is positive, its magnitude saturates as the number of saved people grows s.t. e.g. a button that saves a million people is virtually the same as a button that saves a billion people.

  1. I'm modeling this via Turing RL, where conscious reasoning can be regarded as a form of observation. Ofc this means we are talking about "logical" rather than "physical" causality. ↩︎

I'm curious what is the evidence you see that this is false as a description of the values of just about every human, given that

  • I, a human [citation needed] tell you that this seems to be a description of my values.
  • Almost every culture that ever existed had norms that prioritized helping family, friends and neighbors over helping random strangers, not to mention strangers that you never met.
  • Most people don't do much to help random strangers they never met, with the notable exception of effective altruists, but even most effective altruists only go that far[1].
  • Evolutionary psychology can fairly easily explain helping your family and tribe, but it seems hard to explain impartial altruism towards all humans.

  1. The common wisdom in EA is, you shouldn't donate 90% of your salary or deny yourself every luxury because if you live a fun life you will be more effective at helping others. However, this strikes me as suspiciously convenient and self-serving. ↩︎

Load More