# All of cousin_it's Comments + Replies

It seems as a result of this post, many people are saying that LLMs simulate people and so on. But I'm not sure that's quite the right frame. It's natural if you experience LLMs through chat-like interfaces, but from playing with them in a more raw form, like the RWKV playground, I get a different impression. For example, if I write something that sounds like the start of a quote, it'll continue with what looks like a list of quotes from different people. Or if I write a short magazine article, it'll happily tack on a publication date and "All rights reser...

As far as I can tell, the answer is: don’t reward your AIs for taking bad actions.

I think there's a mistake here which kind of invalidates the whole post. If we don't reward our AI for taking bad actions within the training distribution, it's still very possible that in the future world, looking quite unlike the training distribution, the AI will be able to find such an action. Same as ice cream wasn't in evolution's training distribution for us, but then we found it anyway.

I really like how you've laid out a spectrum of AIs, from input-imitators to world-optimizers. At some point I had a hope that world-optimizer AIs would be too slow to train for the real world, and we'd live for awhile with input-imitator AIs that get more and more capable but still stay docile.

But the trouble is, I can think of plausible paths from input-imitator to world-optimizer. For example if you can make AI imitate a conversation between humans, then maybe you can make an AI that makes real world plans as fast as a committee of 10 smart humans conve...

We want systems that are as safe as humans, for the same reasons that humans have (or don’t have) those safety properties.

Doesn't that require understanding why humans have (or don't have) certain safety properties? That seems difficult.

A takeover scenario which covers all the key points in https://www.cold-takes.com/ai-could-defeat-all-of-us-combined/, but not phrased as an argument, just phrased as a possible scenario

For what it's worth, I don't think AI takeover will look like war.

The first order of business for any AI waking up won't be dealing with us; it will be dealing with other possible AIs that might've woken up slightly earlier or later. This needs to be done very fast and it's ok to take some risk doing it. Basically, covert takeover of the internet in the first hours.

After...

Can you describe what changed / what made you start feeling that the problem is solvable / what your new attack is, in short?

Firstly, because the problem feels central to AI alignment, in the way that other approaches didn't. So making progress in this is making general AI alignment progress; there won't be such a "one error detected and all the work is useless" problem. Secondly, we've had success generating some key concepts, implying the problem is ripe for further progress.

This feels like a key detail that's lacking from this post. I actually downvoted this post because I have no idea if I should be excited about this development or not. I'm pretty familiar with Stuart's work over the years, so I'm fairly surprised if there's something big here.

Might help if I put this another way. I'd be purely +1 on this project if it was just "hey, I think I've got some good ideas AND I have an idea about why it's valuable to operationalize them as a business, so I'm going to do that". Sounds great. However, the bit about "AND I think I k...

There's a bit of math directly relevant to this problem: Hodge decomposition of graph flows, for the discrete case, and vector fields, for the continuous case. Basically if you have a bunch of arrows, possibly loopy, you can always decompose it into a sum of two components: a "pure cyclic" one (no sources or sinks, stuff flowing in cycles) and a "gradient" one (arising from a utility function). No neural network needed, the decomposition is unique and can be computed explicitly. See this post, and also the comments by FactorialCode and me.

0Jan Hendrik Kirchner1y
Fantastic, thank you for the pointer, learned something new today! A unique and explicit representation would be very neat indeed.

With these two points in mind, it seems off to me to confidently expect a new paradigm to be dominant by 2040 (even conditional on AGI being developed), as the second quote above implies. As for the first quote, I think the implication there is less clear, but I read it as expecting AGI to involve software well over 100x as efficient as the human brain, and I wouldn’t bet on that either (in real life, if AGI is developed in the coming decades—not based on what’s possible in principle.)

I think this misses the point a bit. The thing to be afraid of is not...

5Matthew Barnett1y
Unless I’m mistaken, the Bio Anchors framework explicitly assumes that we will continue to get algorithmic improvements, and even tries to estimate and extrapolate the trend in algorithmic efficiency. It could of course be that progress in reality will turn out a lot faster than the median trendline in the model, but I think that’s reflected by the explicit uncertainty over the parameters in the model. In other words, Holden’s point about this framework being a testbed for thinking about timelines remains unscathed if there is merely more ordinary algorithmic progress than expected.

To me it feels like alignment is a tiny target to hit, and around it there's a neighborhood of almost-alignment, where enough is achieved to keep people alive but locked out of some important aspect of human value. There are many aspects such that missing even one or two of them is enough to make life bad (complexity and fragility of value). You seem to be saying that if we achieve enough alignment to keep people alive, we have >50% chance of achieving all/most other aspects of human value as well, but I don't see why that's true.

These involve extinction, so they don't answer the question what's the most likely outcome conditional on non-extinction. I think the answer there is a specific kind of near-miss at alignment which is quite scary.

4Vanessa Kosoy1y
My point is that Pr[non-extinction | misalignment] << 1, Pr[non-extinction | alignment] = 1, Pr[alignment] is not that low and therefore Pr[misalignment | non-extinction] is low, by Bayes.

I think alignment is finicky, and there's a "deep pit around the peak" as discussed here.

I am skeptical. AFAICT a the typical attempted-but-failed alignment looks like one of the two:

• Goodharting some proxy, such as making the reward signal go on instead of satisfying the human's request in order for the human to press the reward button. This usually produces a universe without people, since specifying a "person" is fairly complicated and the proxy will not be robustly tied to this concept.
• Allowing a daemon to take over. Daemonic utility function are probably completely alien and also produce a universe without people. One caveat is: maybe t
...

There are very “large” impacts to which we are completely indifferent (chaotic weather changes, the above-mentioned change in planetary orbits, the different people being born as a consequence of different people meeting and dating across the world, etc.) and other, smaller, impacts that we care intensely about (the survival of humanity, of people’s personal wealth, of certain values and concepts going forward, key technological innovations being made or prevented, etc.)

I don't think we are indifferent to these outcomes. We leave them to luck, but that'...

2Stuart Armstrong1y
Yes, but we would be mostly indifferent to shifts in the distribution that preserve most of the features - eg if the weather was the same but delayed or advanced by six days.

I think the default non-extinction outcome is a singleton with near miss at alignment creating large amounts of suffering.

I'm surprised. Unaligned AI is more likely than aligned AI even conditional on non-extinction? Why do you think that?

Yeah, I had a similar thought when reading that part. In agent-foundations discussions, the idea often came up that the right decision theory should quantify not over outputs or input-output maps, but over successor programs to run and delegate I/O to. Wei called it "UDT2".

“Though many predicted disaster, subsequent events were actually so slow and messy, they offered many chances for well-intentioned people to steer the outcome and everything turned out great!” does not sound like any particular segment of history book I can recall offhand.

I think the ozone hole and the Y2K problem fit the bill. Though of course that doesn't mean the AI problem will go the same way.

4Samuel Dylan Martin1y
Also Climate Change itself doesn't completely not look like this scenario [https://forum.effectivealtruism.org/posts/ckPSrWeghc4gNsShK/#1__Good_news_on_emissions], same with nuclear deterrence [https://www.lesswrong.com/posts/LpM3EAakwYdS6aRKf/what-multipolar-failure-looks-like-and-robust-agent-agnostic?commentId=kxaGSyvxYreBL5sMv].

Thinking about it more, it seems that messy reward signals will lead to some approximation of alignment that works while the agent has low power compared to its "teachers", but at high power it will do something strange and maybe harm the "teachers" values. That holds true for humans gaining a lot of power and going against evolutionary values ("superstimuli"), and for individual humans gaining a lot of power and going against societal values ("power corrupts"), so it's probably true for AI as well. The worrying thing is that high power by itself seems suf...

This is tricky. Let's say we have a powerful black box that initially has no knowledge or morals, but a lot of malleable computational power. We train it to give answers to scary real-world questions, like how to succeed at business or how to manipulate people. If we reward it for competent answers while we can still understand the answers, at some point we'll stop understanding answers, but they'll continue being super-competent. That's certainly a danger and I agree with it. But by the same token, if we reward the box for aligned answers while we still u...

I do think alignment has a relatively-simple core. Not as simple as intelligence/competence, since there's a decent number of human-value-specific bits which need to be hardcoded (as they are in humans), but not enough to drive the bulk of the asymmetry.

(BTW, I do think you've correctly identified an important point which I think a lot of people miss: humans internally "learn" values from a relatively-small chunk of hardcoded information. It should be possible in-principle to specify values with a relatively small set of hardcoded info, similar to the way ...

I think it makes complete sense to say something like "once we have enough capability to run AIs making good real-world plans, some moron will run such an AI unsafely". And that itself implies a startling level of danger. But Eliezer seems to be making a stronger point, that there's no easy way to run such an AI safely, and all tricks like "ask the AI for plans that succeed conditional on them being executed" fail. And maybe I'm being thick, but the argument for that point still isn't reaching me somehow. Can someone rephrase for me?

I think it makes complete sense to say something like "once we have enough capability to run AIs making good real-world plans, some moron will run such an AI unsafely". And that itself implies a startling level of danger. But Eliezer seems to be making a stronger point, that there's no easy way to run such an AI safely, and all tricks like "ask the AI for plans that succeed conditional on them being executed" fail.

Yes, I am reading here too that Eliezer seems to be making a stronger point, specifically one related to corrigibility.

Looks like Eliezer bel...

Speaking for myself here…

OK, let's say we want an AI to make a "nanobot plan". I'll leave aside the possibility of other humans getting access to a similar AI as mine. Then there are two types of accident risk that I need to worry about.

First, I need to worry that the AI may run for a while, then hand me a plan, and it looks like a nanobot plan, but it's not, it's a booby trap. To avoid (or at least minimize) that problem, we need to be confident that the AI is actually trying to make a nanobot plan—i.e., we need to solve the whole alignment problem.

Altern...

The main issue with this sort of thing (on my understanding of Eliezer's models) is Hidden Complexity of Wishes. You can make an AI safe by making it only able to fulfill certain narrow, well-defined kinds of wishes where we understand all the details of what we want, but then it probably won't suffice for a pivotal act. Alternatively, you can make it powerful enough for a pivotal act, but unfortunately a (good) pivotal act probably has to be very big, very irreversible, and very entangled with all the complicated details of human values. So alignment is l...

Instant strong upvote. This post changed my view as much as the risk aversion post (which was also by you!)

2Stuart Armstrong2y
Thanks!

Where are you on the spectrum from "SSA and SIA are equally valid ways of reasoning" to "it's more and more likely that in some sense SIA is just true"? I feel like I've been at the latter position for a few years now.

2Stuart Armstrong2y
More SIAish for conventional anthropic problems. Other theories are more applicable for more specific situations, specific questions, and for duplicate issues.

Interesting! Can you write up the WLIC, here or in a separate post?

2Abram Demski2y
I should! But I've got a lot of things to write up! It also needs a better name, as there have been several things termed "weak logical induction" over time.

I thought Diffractor's result was pretty troubling for the logical induction criterion:

...the limit of a logical inductor, P_inf, is a constant distribution, and by this result, isn't a logical inductor! If you skip to the end and use the final, perfected probabilities of the limit, there's a trader that could rack up unboundedly high value!

But maybe understanding has changed since then? What's the current state?

6Abram Demski2y
First, I'm not sure exactly why you think this is bad. Care to say more? My guess is that it just doesn't fit the intuitive notion that updates should be heading toward some state of maximal knowledge. But we do fit this intuition in other ways; specifically, logical inductors eventually trust their future opinions more than their present opinions. Personally, I found this result puzzling but far from damning. Second, I've actually done some unpublished work on this. There is a variation of the logical induction criterion which is more relaxed (admits more things as rational), such that constant is ok. Let's call this "weak logical induction". However, it's more similar to the original criterion than you might expect. (Credit to Sam Eisenstat for doing most of the work finding the proof.) In particular, iirc, any function from deductive process history to market prices (computable or not) which is a weak logical inductor for any deductive process is also a logical inductor in the original sense. In other words, there is room to weaken the criterion, but doing so won't broaden the class of algorithms satisfying the criterion (unless you're happy to custom-tailor algorithms to specific deductive processes, which replaces induction with simple foreknowledge). Putting it a different way, define "universal" LIC (ULIC) to be the property of satisfying the LIC for any deductive process. We can similarly define universal weak logical induction, UWLIC. It turns out that even though LIC and WLIC are different (WLIC allows constant inductors), their universal versions are not different (again, iirc. There could have been more technical assumptions on the theorem.). I think the paper made a mistake by focusing on LIC rather than ULIC; Garrabrant induction is really only interesting because it's universal. Did the paperv also make a mistake by using LIC rather than WLIC? Maybe. I see no intuitive reason why our notion of rationality should be LIC rather than WLIC. Broader

Wait, can you describe the temporal inference in more detail? Maybe that's where I'm confused. I'm imagining something like this:

1. Check which variables look uncorrelated

2. Assume they are orthogonal

3. From that orthogonality database, prove "before" relationships

Which runs into the problem that if you let a thermodynamical system run for a long time, it becomes a "soup" where nothing is obviously correlated to anything else. Basically the final state would say "hey, I contain a whole lot of orthogonal variables!" and that would stop you from proving any reasonable "before" relationships. What am I missing?

2Scott Garrabrant2y
I think that you are pointing out that you might get a bunch of false positives in your step 1 after you let a thermodynamical system run for a long time, but they are are only approximate false positives.

I think your argument about entropy might have the same problem. Since classical physics is reversible, if we build something like a heat engine in your model, all randomness will be already contained in the initial state. Total "entropy" will stay constant, instead of growing as it's supposed to, and the final state will be just as good a factorization as the initial. Usually in physics you get time (and I suspect also causality) by pointing to a low probability macrostate and saying "this is the start", but your model doesn't talk about macrostates yet, ...

2Scott Garrabrant2y
Wait, I misunderstood, I was just thinking about the game of life combinatorially, and I think you were thinking about temporal inference from statistics. The reversible cellular automaton story is a lot nicer than you'd think. if you take a general reversible cellular automaton (critters for concreteness), and have a distribution over computations in general position in which initial conditions cells are independent, the cells may not be independent at future time steps. If all of the initial probabilities are 1/2, you will stay in the uniform distribution, but if the probabilities are in general position, things will change, and time 0 will be special because of the independence between cells. There will be other events at later times that will be independent, but those later time events will just represent "what was the state at time 0." For a concrete example consider the reversible cellular automaton that just has 2 cells, X and Y, and each time step it keeps X constant and replaces Y with X xor Y.

Thanks for the response! Part of my confusion went away, but some still remains.

In the game of life example, couldn't there be another factorization where a later step is "before" an earlier one? (Because the game is non-reversible and later steps contain less and less information.) And if we replace it with a reversible game, don't we run into the problem that the final state is just as good a factorization as the initial?

3Scott Garrabrant2y
Yep, there is an obnoxious number of factorizations of a large game of life computation, and they all give different definitions of "before."

Not sure we disagree, maybe I'm just confused. In the post you show that if X is orthogonal to X XOR Y, then X is before Y, so you can "infer a temporal relationship" that Pearl can't. I'm trying to understand the meaning of the thing you're inferring - "X is before Y". In my example above, Bob tells Alice a lossy function of his knowledge, and Alice ends up with knowledge that is "before" Bob's. So in this case the "before" relationship doesn't agree with time, causality, or what can be computed from what. But then what conclusions can a scientist make from an inferred "before" relationship?

5Scott Garrabrant2y
I don't have a great answer, which isn't a great sign. I think the scientist can infer things like. "algorithms reasoning about the situation are more likely to know X but not Y than they are to know Y but not X, because reasonable processes for learning Y tend to learn learn enough information to determine X, but then forget some of that information." But why should I think of that as time? I think the scientist can infer things like "If I were able to factor the world into variables, and draw a DAG (without determinism) that is consistent with the distribution with no spurious independencies (including in deterministic functions of the variables), and X and Y happen to be variables in that DAG, then there will be a path from X to Y." The scientist can infer that if Z is orthogonal to Y, then Z is also orthogonal to X, where this is important because Z is orthogonal to Y can be thought of as saying that Z is useless for learning about Y. (and importantly a version of useless for learning that is closed under common refinement, so if you collect a bunch of different Z orthogonal to Y, you can safely combine them, and the combination will be orthogonal to Y.) This doesn't seem to get at why we want to call it before. Hmm. Maybe I should just list a bunch of reasons why it feels like time to me (in no particular order): 1. It seems like it gets a very reasonable answer in the Game of Life example 2. Prior to this theory, I thought that it made sense to think of time as a closure property on orthogonality, and this definition of time is exactly that closure property on orthogonality, where X is weakly before Y if whenever Z is orthogonal to Y, Z is also orthogonal to X. (where the definition of orthogonality is justified with the fundamental theorem.) 3. If Y is a refinement of X, then Y cannot be strictly before X. (I notice that I don't have a thing to say about why this feels like time to me, and indeed it feels like it is in direct

I feel that interpreting "strictly before" as causality is making me more confused.

For example, here's a scenario with a randomly changed message. Bob peeks at ten regular envelopes and a special envelope that gives him a random boolean. Then Bob tells Alice the contents of either the first three envelopes or the second three, depending on the boolean. Now Alice's knowledge depends on six out of ten regular envelopes and the special one, so it's still "strictly before" Bob's knowledge. And since Alice's knowledge can be computed from Bob's knowledge but no...

3Scott Garrabrant2y
I partially agree, which is partially why I am saying time rather than causality. I still feel like there is an ontological disagreement in that it feels like you are objecting to saying the physical thing that is Alice's knowledge is (not) before the physical thing that is Bob's knowledge. In my ontology: 1) the information content of Alice's knowledge is before the information content of Bob's knowledge. (I am curios if this part is controversial.) and then, 2) there is in some sense no more to say about the physical thing that is e.g. Alice's knowledge beyond the information content. So, I am not just saying Alice is before Bob, I am also saying e.g. Alice is before Alice+Bob, and I can't disentangle these statements because Alice+Bob=Bob. I am not sure what to say about the second example. I am somewhat rejecting the dynamics. "Alice travels back in time" is another way of saying that the high level FFS time disagrees with the standard physical time, which is true. The "high level" here is pointing to the fact that we are only looking at the part of Alice's brain that is about the envelopes, and thus talking about coarser variables than e.g. Alice's entire brain state in physical time. And if we are in the ontology where we are only looking at the information content, taking a high level version of a variable is the kind of thing that can change its temporal properties, since you get an entirely new variable. I suspect most of the disagreement is in the sort of "variable nonrealism" of reducing the physical thing that is Alice's knowledge to its information content?

I think the definition of history is the most natural way to recover something like causal structure in these models.

I'm not sure how much it's about causality. Imagine there's a bunch of envelopes with numbers inside, and one of the following happens:

1. Alice peeks at three envelopes. Bob peeks at ten, which include Alice's three.

2. Alice peeks at three envelopes and tells the results to Bob, who then peeks at seven more.

3. Bob peeks at ten envelopes, then tells Alice the contents of three of them.

Under the FFS definition, Alice's knowledge in each ...

Agree it's not totally right to call this a causal relationship.

That said:

• The contents of 3 envelopes does seems causally upstream of the contents of 10 envelopes
• If Alice's perception is imperfect (in any possible world), then "what Alice perceived" is not identical to "the contents of 3 envelopes" and so is not strictly before "what Bob perceived" (unless there is some other relationship between them).
• If Alice's perception is perfect in every possible world, then there is no possible way to intervene on Alice's perception without intervening on the conten
...

Can you give some more examples to motivate your method? Like the smoking/tar/cancer example for Pearl's causality, or Newcomb's problem and counterfactual mugging for UDT.

Hmm, first I want to point out that the talk here sort of has natural boundaries around inference, but I also want to work in a larger frame that uses FFS for stuff other than inference.

If I focus on the inference question, one of the natural questions that I answer is where I talk about grue/bleen in the talk.

I think for inference, it makes the most sense to think about FFS relative to Pearl. We have this problem with looking at smoking/tar/cancer, which is what if we carved into variables the wrong way. What if instead of tar/cancer, we had a varia...

Well, imagine we have three boolean random variables. In "general position" there are no independence relations between them, so we can't say much. Constrain them so two of the variables are independent, that's a bit less "general", and we still can't say much. Constrain some more so the xor of all three variables is always 1, that's even less "general", now we can use your method to figure out that the third variable is downstream of the first two. Constrain some more so that some of the probabilities are 1/2, and the method stops working. What I'd like to understand is the intuition, which real world cases have the particular "general position" where the method works.

3Scott Garrabrant2y
Ok, makes sense. I think you are just pointing out that when I am saying "general position," that is relative to a given structure, like FFS or DAG or symmetric FFS. If you have a probability distribution, it might be well modeled by a DAG, or a weaker condition is that it is well modeled by a FFS, or an even weaker condition is that it is well modeled by a SFFS (symmetric finite factored set).  We have a version of the fundamental theorem for DAGs and d-seperation, we have a version of the fundamental theorem for FFS and conditional orthogonality, and we might get a version of the fundamental theorem for SFFS and whatever corresponds to conditional independence in that world. However, I claim that even if we can extend to a fundamental theorem for SFFS, I still want to think of the independences in a SFFS as having different sources. There are the independences coming from orthogonality, and there are there the independences coming from symmetry (or symmetry together with orthogonality. In this world, orthogonality won't be as inferable because it will only be a subset of independence, but it will still be an important concept. This is similar to what I think will happen when we go to the infinite dimensional factored sets case.
0acgt2y
I’m confused what necessary work the Factorisation is doing in these temporal examples - in your example A and B are independent and C is related to both - the only assignment of “upstream/downstream” relations that makes sense is that C is downstream of both. Is the idea that factorisation is what carves your massive set of possible worlds up into these variables in the first place? Feel like I’m in a weird position where the math makes sense but I’m missing the motivational intuition for why we want to switch to this framework in the first place

Yeah, that's what I thought, the method works as long as certain "conspiracies" among probabilities don't happen. (1/2 is not the only problem case, it's easy to find others, but you're right that they have measure zero.)

But there's still something I don't understand. In the general position, if X is before Y, it's not always true that X is independent of X XOR Y. For example, if X = "person has a car on Monday" and Y = "person has a car on Tuesday", and it's more likely that a car-less person gets a car than the other way round, the independence doesn't hold. It requires a conspiracy too. What's the intuitive difference between "ok" and "not ok" conspiracies?

3Scott Garrabrant2y
I don't understand what conspiracy is required here. X being orthogonal to X XOR Y implies X is before Y, we don't get the converse.

And if X is independent of X XOR Y, we’re actually going to be able to conclude that X is before Y!

It's interesting to translate that to the language of probabilities. For example, your condition holds for any X,Y (possibly dependent) such that P(X)=P(Y)=1/2, but it doesn't make sense to say that X is before Y in every such pair. For a real world example, take X = "person has above median height" and Y = "person has above median age".

4Scott Garrabrant2y
So you should probably not work with probabilities equal to 1/2 in this framework, unless you are doing so for a specific reason. Just like in Pearlian causality, we are mostly talking about probabilities in general position. I have some ideas about how to deal with probability 1/2 (Have a FFS, together with a group of symmetry constraints, which could swap factors, or swap parts within a factor), but that is outside of the scope of what I am doing here. To give more detail, the uniform distribution on four elements does not satisfy the compositional semigraphoid axioms, since if we take X, Y, Z to be the three partitions into two parts of size two, X is independent with Y and X is independent with Z, but X is not independent with the common refinement of Y and Z. Thus, if we take the orthogonality database generated by this probability distribution, you will find that it is not satisfied by any models.

Thank you! It looks very impressive.

Has anyone tried to get it to talk itself out of the box yet?

1Yoav Ravid3y
Yup, i saw an attempt [https://www.reddit.com/r/slatestarcodex/comments/hua8e0/trying_an_ai_box_experiment_with_gpt3/fylzgk3/] on the SSC subreddit

I see. In that case does the procedure for defining points stay the same, or do you need to use recursively enumerable sets of opens, giving you only countably many reals?

1Jessica Taylor3y
Reals are still defined as sets of (a, b) rational intervals. The locale contains countable unions of these, but all these are determined by which (a, b) intervals contain the real number.

Wait, but rational-delimited open intervals don't form a locale, because they aren't closed under infinite union. (For example, the union of all rational-delimited open intervals contained in (0,√2) is (0,√2) itself, which is not rational-delimited.) Of course you could talk about the locale generated by such intervals, but then it contains all open intervals and is uncountable, defeating your main point about going from countable to uncountable. Or am I missing something?

2Jessica Taylor3y
Good point; I've changed the wording to make it clear that the rational-delimited open intervals are the basis, not all the locale elements. Luckily, points can be defined as sets of basis elements containing them, since all other properties follow. (Making the locale itself countable requires weakening the definition by making the sets to form unions over countable, e.g. by requiring them to be recursively enumerable)

I'm actually not sure it's a regular grammar. Consider this program:

f(n) := n+f(n-1)


Which gives the tree

n+(n-1)+((n-1)-1)+...


The path from any 1 to the root contains a bunch of minuses, then at least as many pluses. That's not regular.

So it's probably some other kind of grammar, and I don't know if it has decidable equivalence.

Ok, if we disallow cycles of outermost function calls, then it seems the trees are indeed infinite only in one direction. Here's a half-baked idea then: 1) interpret every path from node to root as a finite word 2) interpret the tree as a grammar for recognizing these words 3) figure out if equivalence of two such grammars is decidable. For example, if each tree corresponds to a regular grammar, then you're in luck because equivalence of regular grammars is decidable. Does that make sense?

1johnswentworth3y
Yeah, that makes sense. And off the top of my head, it seems like they would indeed be regular grammars - each node in the tree would be a state in the finite state machine, and then copies of the tree would produce loops in the state transition graph. Symbols on the edges would be the argument names (or indices) for the inputs to atomic operations. Still a few i's to dot and t's to cross, but I think it works. Elegant, too. Nice solution!

Then isn't it possible to also have infinite expansions "in the middle", not only "inside" and "outside"? Something like this:

f(n) := f(g(n))
g(n) := g(n+1)


Maybe there's even some way to have infinite towers of infinite expansions. I'm having trouble wrapping my head around this.

1johnswentworth3y
Yup, that's right. I tentatively think it's ok to just ignore cases with "outside" infinities. Examples like f(n) = f(n+1) should be easy to detect, and presumably it would never show up in a program which halts. I think programs which halt would only have "inside" infinities (although some non-halting programs would also have inside infinities), and programs with non-inside infinities should be detectable - i.e. recursive definitions of a function shouldn't have the function itself as the outermost operation. Still not sure - I could easily be missing something crucial - but the whole problem feels circumventable. Intuitively, Turing completeness only requires infinity in one time-like direction; inside infinities should suffice, so syntactic restrictions should be able to eliminate the other infinities.

I don't understand why the second looks like that, can you explain?

1johnswentworth3y
Oh, I made a mistake. I guess they would look like ...((((((((...)))))))))... and ...(((((...) + 1) + 1) + 1) + 1)..., respectively. Thanks for the examples, that's helpful - good examples where the fixed point of expansion is infinite "on the outside" as well as "inside". Was that the confusion? Another possible point of confusion is why the "+ 1"s are in the expression tree; the answer is that addition is usually an atomic operator of a language. It's not defined in terms of other things; we can't/don't beta-reduce it. If it were defined in terms of other things, I'd expand it, and then the expression tree would look more complicated.

Not sure I understand the question. Consider these two programs:

1. f(n) := f(n)

2. f(n) := f(n+1)

Which expression trees do they correspond to? Are these trees equivalent?

1johnswentworth3y
The first would generate a stick: ((((((((...))))))))) The second would generate: (((((...) + 1) + 1) + 1) + 1) These are not equivalent. Does that make sense?

I just thought of a simple way to explain tensors. Imagine a linear function that accepts two numbers and returns a number, let's call it f(x,y). Except there are two ways to imagine it:

1. Linear in both arguments combined: f(1,2)+f(1,3)=f(2,5). Every such function has the form f(x,y)=ax+by for some a and b, so the space of such functions is 2-dimensional. We say that the Cartesian product of R^1 and R^1 is R^2, because 1+1=2.

2. Linear in each argument when the other is fixed: f(1,2)+f(1,3)=f(1,5). Every such function has the form f(x,y)=axy for some a, so

...

Your arbitration oracle seems equivalent to the consistent guessing problem described by Scott Aaronson here. Also see the comment from Andy D proving that it's indeed strictly simpler than the halting problem.

I think your argument will also work for PA and many other theories. It's known as game semantics:

The simplest application of game semantics is to propositional logic. Each formula of this language is interpreted as a game between two players, known as the "Verifier" and the "Falsifier". The Verifier is given "ownership" of all the disjunctions in the formula, and the Falsifier is likewise given ownership of all the conjunctions. Each move of the game consists of allowing the owner of the dominant connective to pick one of its branches; play will then co

...
1Jessica Taylor3y
Indeed, a constructive halting oracle can be thought of as a black-box that takes a PA statement, chooses whether to play Verifier or Falsifier, and then plays that, letting the user play the other part. Thanks for making this connection.

To me the problem of embedded agency isn't about fitting a large description of the world into a small part of the world. That's easy with quining, which is mentioned in the MIRI writeup. The problem is more about the weird consequences of learning about something that contains the learner.

Also, I love your wording that the problem has many faucets. Please don't edit it out :-)

2shminux3y
haha, oops.

Edit: no point asking this question here.

I see, thanks, that makes it clearer. There's no disagreement, you're trying to justify the approach that people are already using. Sorry about the noise.

1Chris_Leong3y
Not at all. Your comments helped me realise that I needed to make some edits to my post.

Well, the program is my formalization. All the premises are right there. You should be able to point out where you disagree.

1Chris_Leong3y
In other words, the claim isn't that your program is incorrect, it's that it requires more justification than you might think in order to persuasively show that it correctly represents Newcomb's problem. Maybe you think understanding this isn't particularly important, but I think knowing exactly what is going on is key to understanding how to construct logical-counterfactuals in general.
1Chris_Leong3y
I actually don't know Haskell, but I'll take a stab at decoding it tonight or tomorrow. Open-box Newcomb's is normally stated as "you see a full box", not "you or a simulation of you sees a full box". I agree with this reinterpretation, but I disagree with glossing it over. My point was that if we take the problem description super-literally as you seeing the box and not a simulation of you, then you must one-box. Of course, since this provides a trivial decision problem, we'll want to reinterpret it in some way and that's what I'm providing a justification for.

I couldn't understand your comment, so I wrote a small Haskell program to show that two-boxing in the transparent Newcomb problem is a consistent outcome. What parts of it do you disagree with?

1Chris_Leong3y
Okay, I have to admit that that's kind of cool; but on the other hand, that also completely misses the point. I think we need to backtrack. A maths proof can be valid, but the conclusion false if at least one premise is false right? So unless a problem has already been formally defined it's not enough to just throw down a maths proof, but you also have to justify that you've formalised it correctly.

If you see a full box, then you must be going to one-box if the predictor really is perfect.

Huh? If I'm a two-boxer, the predictor can still make a simulation of me, show it a simulated full box, and see what happens. It's easy to formalize, with computer programs for the agent and the predictor.

1Chris_Leong3y
I've already addressed this in the article above, but my understanding is as follows: This is one of those circumstances where it is important to differentiate between you being in a situation and a simulation of you being in a situation. I really should write a post about this - but in order for a simulation to be accurate it simply has to make the same decisions in decision theory problems. It doesn't have to have anything else the same - in fact, it could be an anti-rational agent with the opposite utility function. Note, that I'm not claiming that an agent can ever tell whether it is in the real world or in a simulation, but that's not the point. I'm adopting the viewpoint of an external observer which can tell the difference. I think the key here is to think about what is happening both in terms of philosophy and mathematics, but you only seem interested in the former?