Wiki Contributions


[Without having looked at the link in your response to my other comment, and I also stopped reading cubefox's comment once it seemed that it was going in a similar direction. ETA: I realized after posting that I have seen that article before, but not recently.]

I'll assume that the robot has a special "memory" sensor which stores the exact experience at the time of the previous tick. It will recognize future versions of itself by looking for agents in its (timeless) 0P model which has a memory of its current experience.

For p("I will see O"), the robot will look in its 0P model for observers which have the t=0 experience in their immediate memory, and selecting from those, how many have judged "I see O" as Here. There will be two such robots, the original and the copy at time 1, and only one of those sees O. So using a uniform prior (not forced by this framework), it would give a 0P probability of 1/2. Similarly for p("I will see C").

Then it would repeat the same process for t=1 and the copy. Conditioned on "I will see C" at t=1, it will conclude "I will see CO" with probability 1/2 by the same reasoning as above. So overall, it will assign: p("I will see OO") = 1/2, p("I will see CO") = 1/4, p("I will see CC") = 1/4

The semantics for these kinds of things is a bit confusing. I think that it starts from an experience (the experience at t=0) which I'll call E. Then REALIZATION(E) casts E into a 0P sentence which gets taken as an axiom in the robot's 0P theory.

A different robot could carry out the same reasoning, and reach the same conclusion since this is happening on the 0P side. But the semantics are not quite the same, since the REALIZATION(E) axiom is arbitrary to a different robot, and thus the reasoning doesn't mean "I will see X" but instead means something more like "They will see X". This suggests that there's a more complex semantics that allows worlds and experiences to be combined - I need to think more about this to be sure what's going on. Thus far, I still feel confident that the 0P/1P distinction is more fundamental than whatever the more complex semantics is.

(I call the 0P -> 1P conversion SENSATIONS, and the 1P -> 0P conversion REALIZATION, and think of them as being adjoints though I haven't formalized this part well enough to feel confident that this is a good way to describe it: there's a toy example here if you are interested in seeing how this might work.)

That's a very good question! It's definitely more complicated once you start including other observers (including future selves), and I don't feel that I understand this as well.

But I think it works like this: other reasoners are modeled (0P) as using this same framework. The 0P model can then make predictions about the 1P judgements of these other reasoners. For something like anticipation, I think it will have to use memories of experiences (which are also experiences) and identify observers for which this memory corresponds to the current experience. Understanding this better would require being more precise about the interplay between 0P and 1P, I think.

(I'll examine your puzzle when I have some time to think about it properly)

Because you don't necessarily know which agent you are. If you could always point to yourself in the world uniquely, then sure, you wouldn't need 1P-Logic. But in real life, all the information you learn about the world comes through your sensors. This is inherently ambiguous, since there's no law that guarantees your sensor values are unique.

If you use X as a placeholder, the statement sensor_observes(X, red) can't be judged as True or False unless you bind X to a quantifier. And this could not mean the thing you want it to mean (all robots would agree on the judgement, thus rendering it useless for distinguishing itself amongst them).

It almost works though, you just have to interpret "True" and "False" a bit differently!

Strong encouragement to write about (1)!

Alright, to check if I understand, would these be the sorts of things that your model is surprised by?

  1. An LLM solves a mathematical problem by introducing a novel definition which humans can interpret as a compelling and useful concept.
  2. An LLM which can be introduced to a wide variety of new concepts not in its training data, and after a few examples and/or clarifying questions is able to correctly use the concept to reason about something.
  3. A image diffusion model which is shown to have a detailed understanding of anatomy and 3D space, such that you can use it to transform an photo of a person into an image of the same person in a novel pose (not in its training data) and angle with correct proportions and realistic joint angles for the person in the input photo.

Is there a specific thing you think LLMs won't be able to do soon, such that you would make a substantial update toward shorter timelines if there was an LLM able to do it within 3 years from now?

That... seems like a big part of what having "solved alignment" would mean, given that you have AGI-level optimization aimed at (indirectly via a counter-factual) evaluating this (IIUC).

Nice graphic!

What stops e.g. "QACI(expensive_computation())" from being an optimization process which ends up trying to "hack its way out" into the real QACI?


For the poset example, I'm using Chu spaces with only 2 colors. I'm also not thinking of the rows or columns of a Chu space as having an ordering (they're sets), you can rearrange them as you please and have a Chu space representing the same structure.

I would suggest reading through to the ## There and Back Again section and in particular while trying to understand how the other poset examples work, and see if that helps the idea click. And/or you can suggest another coloring you think should be possible, and I can tell you what it represents.

I'm not sure if I can find it easily, but I recall Eliezer pointing out (several years ago) that he thought that Value Identification was the "easy part" of the alignment problem, with the getting it to care part being something like an order of magnitude more difficult. He seemed to think (IIRC) this itself could still be somewhat difficult, as you point out. Additionally, the difficulty was always considered in the context of having an alignable AGI (i.e. something you can point in a specific direction), which GPT-N is not under this paradigm.

Load More