All of Vanessa Kosoy's Comments + Replies

Here is a way to construct many learnable undogmatic ontologies, including such with finite state spaces.

A deterministic partial environment (DPE) over action set  and observation set  is a pair  where  and  s.t.

  • If  is a prefix of some , then .
  • If  and  is a prefix of , then .

DPEs are equipped with a natural partial order. Namely,  when   and .

Let  ... (read more)

...the problem of how to choose one's IBH prior. (If the solution was something like "it's subjective/arbitrary" that would be pretty unsatisfying from my perspective.)


It seems clear to me that the prior is subjective. Like with Solomonoff induction, I expect there to exist something like the right asymptotic for the prior (i.e. an equivalence class of priors under the equivalence relation where  and  are equivalent when there exists some  s.t.  and ), but not a unique correct prior, just... (read more)

...I'm still comfortable sticking with "most are wide open".


Allow me to rephrase. The problems are open, that's fair enough. But, the gist of your post seems to be: "Since coming up with UDT, we ran into these problems, made no progress, and are apparently at a dead end. Therefore, UDT might have been the wrong turn entirely." On the other hand, my view is: Since coming up with those problems, we made a lot of progress on agent theory within the LTA, which has implications on those problems among other things, and so far this progress seems to only r... (read more)

4Wei Dai3mo
This is a bit stronger than how I would phrase it, but basically yes. I tend to be pretty skeptical of new ideas. (This backfired spectacularly once, when I didn't pay much attention to Satoshi when he contacted me about Bitcoin, but I think in general has served me well.) My experience with philosophical questions is that even when some approach looks a stone's throw away from a final solution to some problem, a bunch of new problems pop up and show that we're still quite far away. With an approach that is still as early as yours, I just think there's quite a good chance it doesn't work out in the end, or gets stuck somewhere on a hard problem. (Also some people who have digged into the details don't seem as optimistic that it is the right approach.) So I'm reluctant to decrease my probability of "UDT was a wrong turn" too much based on it. The rest of your discussion about 2TDT-1CDT seems plausible to me, although of course depends on whether the math works out, doing something about monotonicity, and also a solution to the problem of how to choose one's IBH prior. (If the solution was something like "it's subjective/arbitrary" that would be pretty unsatisfying from my perspective.)

I'll start with Problem 4 because that's the one where I feel closest to the solution. In your 3-player Prisoner's Dilemma, infra-Bayesian hagglers[1] (IBH agents) don't necessarily play CCC. Depending on their priors, they might converge to CCC or CCD or other Pareto-efficient outcome[2]. Naturally, if the first two agents have identical priors then e.g. DCC is impossible, but CCD still is. Whereas, if all 3 have the same prior they will necessarily converge to CCC. Moreover, there is no "best choice of prior": different choices do better in differen... (read more)

3Wei Dai3mo
I don't understand your ideas in detail (am interested but don't have the time/ability/inclination to dig into the mathematical details), but from the informal writeups/reviews/critiques I've seen of your overall approach, as well as my sense from reading this comment of how far away you are from a full solution to the problems I listed in the OP, I'm still comfortable sticking with "most are wide open". :) On the object level, maybe we can just focus on Problem 4 for now. What do you think actually happens in a 2IBH-1CDT game? Presumably CDT still plays D, and what do the IBH agents do? And how does that imply that the puzzle is resolved? As a reminder, the puzzle I see is that this problem shows that a CDT agent doesn't necessarily want to become more UDT-like, and for seemingly good reason, so on what basis can we say that UDT is a clear advancement in decision theory? If CDT agents similarly don't want to become more IBH-like, isn't there the same puzzle? (Or do they?) This seems different from the playing chicken with a rock example, because a rock is not a decision theory so that example doesn't seem to offer the same puzzle. ETA: Oh, I think you're saying that the CDT agent could turn into a IBH agent but with a different prior from the other IBH agents, that ends up allowing it to still play D while the other two still play C, so it's not made worse off by switching to IBH. Can you walk this through in more detail? How does the CDT agent choose what prior to use when switching to IBH, and how do the different priors actual imply a CCD outcome in the end?

The way I see it, all of these problems are reducible to (i) understanding what's up with the monotonicity principle in infra-Bayesian physicalism and (ii) completing a new and yet unpublished research direction (working title: "infra-Bayesian haggling") which shows that IB agents converge to Pareto efficient outcomes[1]. So, I wouldn't call them "wide open".

  1. ^

    Sometimes, but there are assumptions, see child comment for more details.

4Wei Dai3mo
Even items 1, 3, 4, and 6 are covered by your research agenda? If so, can you quickly sketch what you expect the solutions to look like?

First, I think that the theory of agents is a more useful starting point than metaphilosophy. Once we have a theory of agents, we can build models, within that theory, of agents reasoning about philosophical questions. Such models would be answers to special cases of metaphilosophy. I'm not sure we're going to have a coherent theory of "metaphilosophy" in general, distinct from the theory of agents, because I'm not sure that "philosophy" is an especially natural category[1].

Some examples of what that might look like:

  • An agent inventing a theory of agents in
... (read more)

Here is the sketch of a simplified model for how a metacognitive agent deals with traps.

Consider some (unlearnable) prior  over environments, s.t. we can efficiently compute the distribution  over observations given any history . For example, any prior over a small set of MDP hypotheses would qualify. Now, for each , we regard  as a "program" that the agent can execute and form beliefs about. In particular, we have a "metaprior"  consisting of metahypotheses: hypotheses-about-programs. 

For ... (read more)

2Vanessa Kosoy4mo
Recording of a talk I gave in VAISU 2023.
2Vanessa Kosoy4mo
Here is the sketch of a simplified model for how a metacognitive agent deals with traps. Consider some (unlearnable) prior ζ over environments, s.t. we can efficiently compute the distribution ζ(h) over observations given any history h. For example, any prior over a small set of MDP hypotheses would qualify. Now, for each h, we regard ζ(h) as a "program" that the agent can execute and form beliefs about. In particular, we have a "metaprior" ξ consisting of metahypotheses: hypotheses-about-programs.  For example, if we let every metahypothesis be a small infra-RDP satisfying appropriate assumptions, we probably have an efficient "metalearning" algorithm. More generally, we can allow a metahypothesis to be a learnable mixture of infra-RDPs: for instance, there is a finite state machine for specifying "safe" actions, and the infra-RDPs in the mixture guarantee no long-term loss upon taking safe actions. In this setting, there are two levels of learning algorithms: * The metalearning algorithm, which learns the correct infra-RDP mixture. The flavor of this algorithm is RL in a setting where we have a simulator of the environment (since we can evaluate ζ(h) for any h). In particular, here we don't worry about exploitation/exploration tradeoffs. * The "metacontrol" algorithm, which given an infra-RDP mixture, approximates the optimal policy. The flavor of this algorithm is "standard" RL with exploitation/exploration tradeoffs. In the simplest toy model, we can imagine that metalearning happens entirely in advance of actual interaction with the environment. More realistically, the two needs to happen in parallel. It is then natural to apply metalearning to the current environmental posterior rather than the prior (i.e. the histories starting from the history that already occurred). Such an agent satisfies "opportunistic" guarantees: if at any point of time, the posterior admits a useful metahypothesis, the agent can exploit this metahypothesis. Thus, we address both

Jobst Heitzig asked me whether infra-Bayesianism has something to say about the absent-minded driver (AMD) problem. Good question! Here is what I wrote in response:

Philosophically, I believe that it is only meaningful to talk about a decision problem when there is also some mechanism for learning the rules of the decision problem. In ordinary Newcombian problems, you can achieve this by e.g. making the problem iterated. In AMD, iteration doesn't really help because the driver doesn't remember anything that happened before. We can consider a version of iter

... (read more)

Physicalist agents see themselves as inhabiting an unprivileged position within the universe. However, it's unclear whether humans should be regarded as such agents. Indeed, monotonicity is highly counterintuitive for humans. Moreover, historically human civilization struggled a lot with accepting the Copernican principle (and is still confused about issues such as free will, anthropics and quantum physics which physicalist agents shouldn't be confused about). This presents a problem for superimitation.

What if humans are actually cartesian agents? Then, it... (read more)

Until now I believed that a straightforward bounded version of the Solomonoff prior cannot be the frugal universal prior because Bayesian inference under such a prior is NP-hard. One reason it is NP-hard is the existence of pseudorandom generators. Indeed, Bayesian inference under such a prior distinguishes between a pseudorandom and a truly random sequence, whereas a polynomial-time algorithm cannot distinguish between them. It also seems plausible that, in some sense, this is the only obstacle: it was established that if one-way functions don't exist (wh... (read more)

I have a question about the conjecture at the end of Direction 17.5. Let  be a utility function with values in  and let  be a strictly monotonous function. Then  and  have the same maxima.  can be non-linear, e.g. . Therefore, I wonder if the condition  should be weaker.

No, because it changes the expected value of the utility function under various distributions.

Moreover, I ask myself if it is possible to modify  by a smal

... (read more)

Oops. What if instead of "for any " we go with "there exists "?

4Scott Garrabrant8mo
Then it is equivalent to the thing I call B2 in edit 2 in the post (Assuming A1-A3). In this case, your modified B2 is my B2, and your B3 is my A4, which follows from A5 assuming A1-A3 and B2, so your suspicion that these imply C4 is stronger than my Q6, which is false, as I argue here. However, without A5, it is actually much easier to see that this doesn't work. The counterexample here satisfies my A1-A3, your weaker version of B2, your B3, and violates C4.

Here’s a plausible human circular preference. You won a prize! Your three options are: (A) 5 lovely plates, (B) 5 lovely plates and 10 ugly plates, (C) 5 OK plates.

No one has done this exact experiment to my knowledge, but plausibly (based on discussion of a similar situation in Thinking Fast And Slow chapter 15) this is a circular preference in at least some people: When people see just A & B, they'll pick B because "it's more stuff, I can always keep the ugly ones as spares or use them for target practice or whatever". When they see just B & C, t

... (read more)
2Steve Byrnes8mo
Well, I guess it wouldn't be a circular preference for you. :) I think it wouldn't occur to many people that they could do one thing with the better 5 plates, and do a different thing with the worse 10 plates, if the plates are not presented in a way the 5+10 division salient. Imagine the better and worse ones are all mixed up, and they're all the same design, such that they're obviously meant to be used as a set, but 2/3rds of the plates in the set have obvious cracks and chips. My impression (again see related experiments in the book chapter) is that many people would just take in the set of 15 plates as a whole and say "man, we can't eat off these, someone could get a cut, the sauce would leak onto the table etc.". The person would have to be kinda thinking outside the box and putting in some effort to notice that there are 5 plates in the set with no chips or cracks, and think of the strategy where they use those and throw out the other 10.

I propose the axioms A1-A3 together with

B2. If  then for any  we have 
B3. If  and , then for any  we have 

I suspect that these imply C4.

2Scott Garrabrant8mo
Your B3 is equivalent to A4 (assuming A1-3).
6Scott Garrabrant8mo
Your B2 is going to rule out a bunch of concave functions. I was hoping to only use axioms consistent with all (continuous) concave functions.

Maybe I am confused by what you mean by . I thought it was the state space, but that isn't consistent with  in your post which was defined over ?

I'm not entirely sure what you mean by the state space.  is a state space associated specifically with the utility function. It has nothing to do with the state space of the environment. The reward function in the OP is , not . I slightly abused notation by defining  in the parent comment. Let's say it's  and  is... (read more)

I think that step 6 is supposed to say "from 5 and 3" instead of "from 4 and 1"?

2Abram Demski8mo
Thanks, fixing!

Good idea!

Example 1

Fix some alphabet . Here's how you make an automaton that checks that the input sequence (an element of ) is a subsequence of some infinite periodic sequence with period . For every in , let be an automaton that checks whether the symbols in the input sequences at places s.t. are all equal (its number of states is ). We can modify it to make a transducer that produces its unmodified input sequence if the test passes and if the test fails. It also produces when the input is . We then chain ... (read more)

The problem is that any useful prior must be based on Occam's razor, and Occam's razor + first-person POV creates the same problems as with the universal prior. And deliberately filtering out simulation hypotheses seems quite difficult, because it's unclear to specify it. See also this.

1Thane Ruthenis9mo
Aha, that's the difficulty I was overlooking. Specifically, I didn't consider that the approach under consideration here requires us to formally define how we're filtering them out. Thanks!

This is not a typo.

I'm imagining that we have a program that outputs (i) a time discount parameter , (ii) a circuit for the transition kernel of an automaton and (iii) a circuit for a reward function (and, ii+iii are allowed to have a shared component to save computation time complexity). The utility function is defined by

where is defined recursively by

Okay, I think this makes sense. The idea is trying to re-interpret the various functions in the utility function as a single function and asking about the notion of complexity on that function which combines the complexity of producing a circuit which computes that function and the complexity of the circuit itself. But just to check: is T over  S×A×O→S? I thought T in utility functions only depended on states and actions S×A→S?  Maybe I am confused by what you mean by S. I thought it was the state space, but that isn't consistent with r in your post which was defined over A×O→Q? As a follow up: defining r as depending on actions and observations instead of actions and states (which e.g. the definition in POMDP on Wikipedia) seems like it changes things.  So I'm not sure if you intended the rewards to correspond with the observations or 'underlying' states.  One more question, this one about the priors: what are they a prior over exactly? I will use the letters/terms from to try to be explicit. Is the prior capturing the "set of conditional observation probabilities" (O on Wikipedia)? Or is it capturing the "set of conditional transition probabilities between states" (T on Wikipedia)? Or is it capturing a distribution over all possible T and O? Or are you imaging that T is defined with U (and is non-random) and O is defined within the prior?  I ask because the term DKL(ζ0||ζ) will be positive infinity if ζ is zero for any value where ζ0 is non-zero. Which makes the interpretation that it is either O or T directly pretty strange (for example, in the case where there are two states s1 and s2 and two obersvations o1 and o2 an O where P(si|oi)=1 and P(si|oj)=0 if i≠j would have a KL divergence of infinity from the ζ0 if ζ0 had non-zero probability on P(s1|o2)). So, I assume this is a prior over what the conditional observation matrices might be. I am assuming that your comment above implies tha

We received no submissions so far, but I think that such submissions will appear here in the "prize claims" section.

For the contrived reward function you suggested, we would never have . But for other reward functions, it is possible that . Which is exactly why this framework rejects the contrived reward function in favor of those other reward functions. And also why this framework considers some policies unintelligent (despite the availability of the contrived reward function) and other policies intelligent.

Up to light editing, the following was written by me during the "Finding the Right Abstractions for healthy systems" research workshop, hosted by Topos Institute in January 2023. However, I invented the idea before.

In order to allow (the set of programs) to be infinite in IBP, we need to define the bridge transform for infinite . At first, it might seem can be allowed to be any compact Polish space, and the bridge transform should only depend on the topology on , but that runs into problems. Instead, the right structure on for defining the bridge t... (read more)

The following was written by me during the "Finding the Right Abstractions for healthy systems" research workshop, hosted by Topos Institute in January 2023. However, I invented the idea before.

Here's an elegant diagrammatic notation for constructing new infrakernels out of given infrakernels. There is probably some natural category-theoretic way to think about it, but at present I don't know what it is.

By “infrakernel” we will mean a continuous mapping of the form , where and are compact Polish spaces and is the space of credal sets (i.e. close... (read more)

My framework discards such contrived reward functions because it penalizes for the complexity of the reward function. In the construction you describe, we have . This corresponds to (no/low intelligence). On the other hand, policies with (high intelligence) have the property that for the which "justifies" this . In other words, your "minimal" overhead is very large from my point of view: to be acceptable, the "overhead" should be substantially negative.

1David Scott Krueger10mo
I think the construction gives us $C(\pi) \leq C(U) + e$ for a small constant $e$ (representing the wrapper).  It seems like any compression you can apply to the reward function can be translated to the policy via the wrapper.  So then you would never have $C(\pi) >> C(U)$.  What am I missing/misunderstanding?

The post is still largely up-to-date. In the intervening year, I mostly worked on the theory of regret bounds for infra-Bayesian bandits, and haven't made much progress on open problems in infra-Bayesian physicalism. On the other hand, I also haven't found any new problems with the framework.

The strongest objection to this formalism is the apparent contradiction between the monotonicity principle and the sort of preferences humans have. While my thinking about this problem evolved a little, I am still at a spot where every solution I know requires biting a... (read more)

First, the notation makes no sense. The prior is over hypotheses, each of which is an element of . is the notation used to denote a single hypothesis.

Second, having a prior just over doesn't work since both the loss function and the counterfactuals depend on .

Third, the reason we don't just start with a prior over , is because it's important which prior we have. Arguably, the correct prior is the image of a simplicity prior over physicalist hypotheses by the bridge transform. But, come to think about it, it might be about the sa... (read more)

deserves a little more credit than you give it. To interpret the claim correctly, we need to notice and are classes of decision problems, not classes of proof systems for decision problems. You demonstrate that for a fixed proof system it is possible that generating proofs is easier than verifying proofs. However, if we fix a decision problem and allow any valid (i.e. sound and complete) proof system, then verifying cannot be harder than generating. Indeed, let be some proof system and an algorithm for generating proofs (i.e. an algorithm t... (read more)

First, no, the AGI is not going to "employ complex heuristics to ever-better approximate optimal hypotheses update". The AGI is going to be based on an algorithm which, as a mathematical fact (if not proved then at least conjectured), converges to the correct hypothesis with high probability. Just like we can prove that e.g. SVMs converge to the optimal hypothesis in the respective class, or that particular RL algorithms for small MDPs converge to the correct hypothesis (assuming realizability).

Second, there's the issue of non-cartesian attacks ("hacking t... (read more)

I don't think the argument on hacking relied on the ability to formally verify systems. Formally verified systems could potentially skew the balance of power to the defender side, but even if they don't exist, I don't think balance is completely skewed to the attacker.

My point was not about the defender/attacker balance. My point was that even short-term goals can be difficult to specify, which undermines the notion that we can easily empower ourselves by short-term AI.

Of course we need to understand how to define "long term" and "short term" here. O

... (read more)

Thanks for the responses Boaz!

Our claim is that one can separate out components - there is the predictable component which is non stationary, but is best approximated with a relatively simple baseline, and the chaotic component, which over the long run is just noise.In general, highly complex rules are more sensitive to noise (in fact, there are theorems along these lines in the field of Analysis of Boolean Functions), and so in the long run, the simpler component will dominate the accuracy.

I will look into analysis of boolean functions, thank you. How... (read more)

Hi Vanesssa, Perhaps given my short-term preference, it's not surprising that I find it hard to track very deep comment threads, but let me just give a couple of short responses. I don't think the argument on hacking relied on the ability to formally verify systems. Formally verified systems could potentially skew the balance of power to the defender side, but even if they don't exist, I don't think balance is completely skewed to the attacker. You could imagine that, like today, there is a "cat and mouse" game, where both attackers and defenders try to find "zero day vulnerabilities" and exploit (in one case) or fix (in the other). I believe that in the world of powerful AI, this game would continue, with both sides having access to AI tools, which would empower both but not necessarily shift the balance to one or the other.  I think the question of whether a long-term planning agent could emerge from short-term training is a very interesting technical question!  Of course we need to understand how to define "long term" and "short term" here.  One way to think about this is the following: we can define various short-term metrics, which are evaluable using information in the short-term, and potentially correlated with long-term success. We would say that a strategy is purely long-term if it cannot be explained by making advances on any combination of these metrics.

IIUC the thesis of this article rest on several interrelated claims:

  1. Long-term planning is not useful because of chaos
  2. Short-term AIs have no alignment problem
  3. Among humans, skill is not important for leadership, beyond some point
  4. Human brains have an advantage w.r.t. animals because of "universality", and any further advantage can only come from scaling with resources.

I wish to address these claims one by one.

Claim 1

This is an erroneous application of chaos theory IMO. The core observation of chaos theory is, that in many dynamical systems with compa... (read more)

Hi Vanessa,

Let me try to respond (note the claim numbers below are not the same as in the essay, but rather as in Vanessa's comment):

Claim 1: Our claim is that one can separate out components - there is the predictable component which is non stationary, but is best approximated with a relatively simple baseline, and the chaotic component, which over the long run is just noise.In general, highly complex rules are more sensitive to noise (in fact, there are theorems along these lines in the field of Analysis of Boolean Functions), and so in the long run, the... (read more)

Even if we did make a goal program, it's still unknown how to build an AGI that is motivated to compute it, or to follow the goals it outputs.

Actually, it is (to a 0th approximation) known how to build an AGI that is motivated to compute it: use infra-Bayesian physicalism. The loss function in IBP already has the semantics "which programs should run". Following the goal it outputs is also formalizable within IBP, but even without this step we can just have utopia inside the goal program itself[1].

  1. We should be careful to prevent the inhabitants of th

... (read more)


I think that in your example, if a person is given a button that can save a person on a different planet from being tortured, they will have a direct incentive to press the button, because the button is a causal connection in itself, and consciously reasoning about the person on the other planet is a causal[1] connection in the other direction. That said, a person still has a limited budget of such causal connections (you cannot reason about a group of arbitrarily many people, with fixed non-zero amount of paying attention to the individual details of ... (read more)

I'm curious what is the evidence you see that this is false as a description of the values of just about every human, given that

  • I, a human [citation needed] tell you that this seems to be a description of my values.
  • Almost every culture that ever existed had norms that prioritized helping family, friends and neighbors over helping random strangers, not to mention strangers that you never met.
  • Most people don't do much to help random strangers they never met, with the notable exception of effective altruists, but even most effective altruists only go that
... (read more)

First, you can consider preferences that are impartial but sublinear in the number of people. So, you can disagree with Nate's room analogy without the premise "stuff only matters if it adds to my own life and experiences".

Second, my preferences are indeed partial. But even that doesn't mean "stuff only matters if it adds to my own life and experiences". I do think that stuff only matters (to me) if it's in some sense causally connected to my life and experiences. More details here.

Third, I don't know what do you mean by "good". The questions that I unders... (read more)

2Rob Bensinger1y
Yeah, I'm also talking about question 1. Seems obviously false as a description of my values (and, I'd guess, just about every human's). Consider the simple example of a universe that consists of two planets: mine, and another person's. We don't have spaceships, so we can't interact. I am not therefore indifferent to whether the other person is being horribly tortured for thousands of years. If I spontaneously consider the hypothetical, I will very strongly prefer that my neighbor not be tortured. If we add the claims that I can't affect it and can't ever know about it, I don't suddenly go "Oh, never mind, fuck that guy". Stuff that happens to other people is real, even if I don't interact with it.

and, i'd guess that one big universe is more than twice as Fun as two small universes, so even if there were no transaction costs it wouldn't be worth it. (humans can have more fun when there's two people in the same room, than one person each in two separate rooms.)

This sounds astronomically wrong to me. I think that my personal utility function gets close to saturation with a tiny fraction of the resources in universe-shard. Two people is one room is better than two people in separate rooms, yes. But, two rooms with trillion people each is virtually t... (read more)

But, two rooms with trillion people each is virtually the same as one room with two trillion. The returns on interactions with additional people fall off exponentially past the Dunbar number.

You're conflating "would I enjoy interacting with X?" with "is it good for X to exist?". Which is almost understandable given that Nate used the "two people can have more fun in the same room" example to illustrate why utility isn't linear in population. But this comment has an IMO bizarre amount of agreekarma (26 net agreement, with 11 votes), which makes me wonder if... (read more)

There's also the ALTER prize for progress on the learning-theoretic agenda.

Yes, absolutely! The contest is not a publication venue.

A major impediment in applying RL theory to any realistic scenario is that even the control problem[1] is intractable when the state space is exponentially large (in general). Real-life agents probably overcome this problem by exploiting some special properties of real-life environments. Here are two strong candidates for such properties:

  • In real life, processes can often be modeled as made of independent co-existing parts. For example, if I need to decide on my exercise routine for the next month and also on my research goals for the next month, the two
... (read more)

A question that often comes up in discussion of IRL: are agency and values purely behavioral concepts, or do they depend on how the system produces its behavior? The cartesian measure of agency I proposed seems purely behavioral, since it only depends on the policy. The physicalist version seems less so since it depends on the source code, but this difference might be minor: this role of the source is merely telling the agent "where" it is in the universe. However, on closer examination, the physicalist is far from purely behaviorist, and this is true e... (read more)

The spectrum you're describing is related, I think, to the spectrum that appears in the AIT definition of agency where there is dependence on the cost of computational resources. This means that the same system can appear agentic from a resource-scarce perspective but non-agentic from a resource-abundant perspective. The former then corresponds to the Vingean regime and the latter to the predictable regime. However, the framework does have a notion of prior and not just utility, so it is possible to ascribe beliefs to Vingean agents. I think it makes sense... (read more)

Causality in IBP

There seems to be an even more elegant way to define causal relationships between agents, or more generally between programs. Starting from a hypothesis , for , we consider its bridge transform . Given some subset of programs we can define then project to [1]. We can then take bridge transform again to get some . The factor now tells us which programs causally affect the manifestation of programs in . Notice that by Proposition 2.8 in the IBP article, when we just get all pro... (read more)

1Martín Soto1y
Hi Vanessa! Thanks again for your previous answers. I've got one further concern.         Are all mesa-optimizers really only acausal attackers? I think mesa-optimizers don't need to be purely contained in a hypothesis (rendering them acausal attackers), but can be made up of a part of the hypotheses-updating procedures (maybe this is obvious and you already considered it). Of course, since the only way to change the AGI's actions is by changing its hypotheses, even these mesa-optimizers will have to alter hypothesis selection. But their whole running program doesn't need to be captured inside any hypothesis (which would be easier for classifying acausal attackers away). That is, if we don't think about how the AGI updates its hypotheses, and just consider them magically updating (without any intermediate computations), then of course, the only mesa-optimizers will be inside hypotheses. If we actually think about these computations and consider a brute-force search over all hypotheses, then again they will only be found inside hypotheses, since the search algorithm itself is too simple and provides no further room for storing a subagent (even if the mesa-optimizer somehow takes advantage of the details of the search). But if more realistically our AGI employs more complex heuristics to ever-better approximate optimal hypotheses update, mesa-optimizers can be partially or completely encoded in those (put another way, those non-optimal methods can fail / be exploited). This failure could be seen as a capabilities failure (in the trivial sense that it failed to correctly approximate perfect search), but I think it's better understood as an alignment failure. The way I see PreDCA (and this might be where I'm wrong) is as an "outer top-level protocol" which we can fit around any superintelligence of arbitrary architecture. That is, the superintelligence will only have to carry out the hypotheses update (plus some trivial calculations over hypotheses to find the best

The problem of future unaligned AI leaking into human imitation is something I wrote about before. Notice that IDA-style recursion help a lot, because instead of simulating a process going deep into the external timeline's future, you're simulating a "groundhog day" where the researcher wakes up over and over at the same external time (more realistically, the restart time is drifting forward with the time outside the simulation) with a written record of all their previous work (but no memory of it). There can still be a problem if there is a positive proba... (read more)

I think it's a terrible idea to automatically adopt an equilibrium notion which incentivises the players to come up with increasingly nasty threats as fallback if they don't get their way. And so there seems to be a good chunk of remaining work to be done, involving poking more carefully at the CoCo value and seeing which assumptions going into it can be broken.

I'm not convinced there is any real problem here. The intuitive negative reaction we have to this "ugliness" is because of (i) empathy and (ii) morality. Empathy is just a part of the utility fun... (read more)

This is a fascinating result, but there is a caveat worth noting. When we say that e.g. AlphaGo is "superhuman at go" we are comparing it humans who (i) spent years training on the task and (ii) were selected for being the best at it among a sizable population. On the other hand, with next token prediction we're nowhere near that amount of optimization on the human side. (That said, I also agree that optimizing a model on next token prediction is very different from optimizing it for text coherence would be, if we could accomplish the latter.)

3Buck Shlegeris1y
Yeah, I agree that it would be kind of interesting to see how good humans would get at this if it was a competitive sport. I still think my guess is that the best humans would be worse than GPT-3, and I'm unsure if they're worse than GPT-2. (There's no limit on anyone spending a bunch of time practicing this game, if for some reason someone gets really into it I'd enjoy hearing about the results.)

The short answer is, I don't know.

The long answer is, here are some possibilities, roughly ordered from "boring" to "weird":

  1. The framework is wrong.
  2. The framework is incomplete, there is some extension which gets rid of monotonicity. There are some obvious ways to make such extensions, but they look uglier and without further research it's hard to say whether they break important things or not.
  3. Humans are just not physicalist agents, you're not supposed to model them using this framework, even if this framework can be useful for AI. This is why humans too
... (read more)

The problem is that if implies that creates but you consider a counterfactual in which doesn't create then you get an inconsistent hypothesis i.e. a HUC which contains only 0. It is not clear what to do with that. In other words, the usual way of defining counterfactuals in IB (I tentatively named it "hard counterfactuals") only makes sense when the condition you're counterfactualizing on is something you have Knightian uncertainty about (which seems safe to assume if this condition is about your own future action but not safe to assume in genera... (read more)

it would be the best possible model of this type, at the task of language modeling on data sampled from the same distribution as MassiveText

Transformers a Turing complete, so "model of this type" is not much of a constraint. On the other hand, I guess it's theoretically possible that some weight matrices are inaccessible to current training algorithms no matter how much compute and data we have. It seems also possible that the scaling law doesn't go on forever, but phase-transitions somewhere (maybe very far) to a new trend which goes below the "irreducible" term.

Load More