Tsvi Benson-Tilsen

Wiki Contributions


An important thing that the AGI alignment field never understood:

Reflective stability. Everyone thinks it's about, like, getting guarantees, or something. Or about rationality and optimality and decision theory, or something. Or about how we should understand ideal agency, or something.

But what I think people haven't understood is

  1. If a mind is highly capable, it has a source of knowledge.
  2. The source of knowledge involves deep change.
  3. Lots of deep change implies lots of strong forces (goal-pursuits) operating on everything.
  4. If there's lots of strong goal-pursuits operating on everything, nothing (properties, architectures, constraints, data formats, conceptual schemes, ...) sticks around unless it has to stick around.
  5. So if you want something to stick around (such as the property "this machine doesn't kill all humans") you have to know what sort of thing can stick around / what sort of context makes things stick around, even when there are strong goal-pursuits around, which is a specific thing to know because most things don't stick around.
  6. The elements that stick around and help determine the mind's goal-pursuits have to do so in a way that positively makes them stick around (reflective stability of goals).

There's exceptions and nuances and possible escape routes. And the older Yudkowsky-led research about decision theory and tiling and reflective probability is relevant. But this basic argument is in some sense simpler (less advanced, but also more radical ("at the root")) than those essays. The response to the failure of those essays can't just be to "try something else about alignment"; the basic problem is still there and has to be addressed.

(related elaboration: https://tsvibt.blogspot.com/2023/01/a-strong-mind-continues-its-trajectory.html https://tsvibt.blogspot.com/2023/01/the-voyage-of-novelty.html )

Probabilities on summary events like this are mostly pretty pointless. You're throwing together a bunch of different questions, about which you have very different knowledge states (including how much and how often you should update about them).

IDK if this is a crux for me thinking this is very relevant to stuff on my perspective, but:

The training procedure you propose doesn't seem to actually incentivize indifference. First, a toy model where I agree it does incentivize that:

On the first time step, the agent gets a choice: choose a number 1--N. If the agent says k, then the agent has nothing at all to do for the first k steps, after which some game G starts. (Each play of G is i.i.d., not related to k.)

So this agent is indeed incentivized to pick k uniformly at random from 1--N. Now consider:

The agent is in a rich world. There are many complex multi-step plans to incentivize agent to learn problem-solving. Each episode, at time N, the agent gets to choose: end now, or play 10 more steps.

Does this incentivize random choice at time N? No. It incentivizes the agent to choose randomly End or Continue at the very beginning of the episode, and then carefully plan and execute behavior that acheives the most reward assuming a run of length N or N+10 respectively.

Wait, but isn't this success? Didn't we make the agent have no trajectory length preference?

No. Suppose:

Same as before, but now there's a little guy standing by the End/Continue button. Sometimes he likes to press button randomly.

Do we kill the guy? Yes we certainly do, he will mess up our careful plans.

I'm not sure I understand your question at all, sorry. I'll say my interpretation and then answer that. You might be asking:

Is the point of the essay summed up by saying: " "Thing=Nexus" is not mechanistic/physicalist, but it's still useful; in general, explanations can be non-mechanistic etc., but still be useful, perhaps by giving a functional definition of something."?

My answer is no, that doesn't sum up the essay. The essay makes these claims:

  1. There many different directions in conceptspace that could be considered "more foundational", each with their own usefulness and partial coherence.
  2. None of these directions gives a total ordering that satisfies all the main needs of a "foundational direction".
  3. Some propositions/concepts not only fail to go in [your favorite foundational direction], but are furthermore circular; they call on themselves.
  4. At least for all the "foundational directions" I listed, circular ideas can't be going in that direction, because they are circular.
  5. Nevertheless, a circular idea can be pretty useful.

I did fail to list "functional" in my list of "foundational directions", so thanks for bringing it up. What I say about foundational directions would also apply to "functional".

As you can see, the failures lie on a spectrum, and they're model-dependent to boot.

And we can go further and say that the failures lie in a high-dimensional space, and that the apparent tradeoff is more a matter of finding the directions in which to pull the rope sideways. Propagating constraints between concepts and propositions is a way to go that seems hopeworthy to me. One wants to notice commonalities in how each of one's plans are doomed, and then address the common blockers / missing ideas. In other words, recurse to the "abstract" as much as is called for, even if you get really abstract; but treat [abstracting more than what you can directly see/feel as being demanded by your thinking] as a risky investment with opportunity cost.

Thanks for writing this and engaging in the comments. "Humans/humanity offer the only real GI data, so far" is a basic piece of my worldview and it's nice to have a reference post explaining something like that.

I roughly agree. As I mentioned to Adele, I think you could get sort of lame edge cases where the LLM kinda helped find a new concept. The thing that would make me think the end is substantially nigher is if you get a model that's making new concepts of comparable quality at a comparable rate to a human scientist in a domain in need of concepts.

if you nail some Chris Olah style transparency work

Yeah that seems right. I'm not sure what you mean by "about language". Sorta plausibly you could learn a little something new about some non-language domain that the LLM has seen a bunch of data about, if you got interpretability going pretty well. In other words, I would guess that LLMs already do lots of interesting compression in a different way than humans do it, and maybe you could extract some of that. My quasi-prediction would be that those concepts

  1. are created using way more data than humans use for many of their important concepts; and
  2. are weirdly flat, and aren't suitable out of the box for a big swath of the things that human concepts are suitable for.

Ok yeah I agree with this. Related: https://tsvibt.blogspot.com/2023/09/the-cosmopolitan-leviathan-enthymeme.html#pointing-at-reality-through-novelty

And an excerpt from a work in progress:

Example: Blueberries

For example, I reach out and pick up some blueberries. This is some kind of expression of my values, but how so? Where are the values?

Are the values in my hands? Are they entirely in my hands, or not at all in my hands? The circuits that control my hands do what they do with regard to blueberries by virtue of my hands being the way they are. If my hands were different, e.g. really small or polydactylous, my hand-controller circuits would be different and would behave differently when getting blueberries. And the deeper circuits that coordinate visual recognition of blueberries, and the deeper circuits that coordinate the whole blueberry-getting system and correct errors based on blueberrywise success or failure, would also be different. Are the values in my visual cortext? The deeper circuits require some interface with my visual cortex, to do blueberry find-and-pick-upping. And having served that role, my visual cortex is specially trained for that task, and it will even promote blueberries in my visual field to my attention more readily than yours will to you. And my spatial memory has a nearest-blueberries slot, like those people who always know which direction is north.

It may be objected that the proximal hand-controllers and the blueberry visual circuits are downstream of other deeper circuits, and since they are downstream, they can be excluded from constituting the value. But that's not so clear. To like blueberries, I have to know what blueberries are, and to know what blueberries are I have to interact with them. The fact that I value blueberries relies on me being able to refer to blueberries. Certainly, if my hands were different but comparably versatile, then I would learn to use them to refer to blueberries about as well as my real hands do. But the reference to (and hence the value of) blueberries must pass through something playing the role that hands play. The hands, or something else, must play that role in constituting the fact that I value blueberries.

The concrete is never lost

In general, values are founded on reference. The context that makes a value a value has to provide reference.

The situation is like how an abstract concept, once gained, doesn't overwrite and obselete what was abstracted from. Maxwell's equations don't annihilate Faraday's experiments in their detail. The experiments are unified in idea--metaphorically, the field structures are a "cross-section" of the messy detailed structure of any given experiment. The abstract concepts, to say something about a specific concrete experimental situation, have to be paired with specific concrete calculations and referential connections. The concrete situations are still there, even if we now, with our new abstract concepts, want to describe them differently.

Is reference essentially diasystemic?

If so, then values are essentially diasystemic.

Reference goes through unfolding.

To refer to something in reality is to be brought (or rather, bringable) to the thing. To be brought to a thing is to go to where the thing really is, through whatever medium is between the mind and where the thing really is. The "really is" calls on future novelty. See "pointing at reality through novelty".

In other words, reference is open--maybe radically open. It's supposed to incorporate whatever novelty the mind encounters--maybe deeply.

An open element can't be strongly endosystemic.

An open element will potentially relate to (radical, diasystemic) novelty, so its way of relating to other elements can't be fully stereotyped by preexisting elements with their preexisting manifest relations.

It's definitely like symbol grounding, though symbol grounding is usually IIUC about "giving meaning to symbols", which I think has the emphasis on epistemic signifying?

I think that your question points out how the concepts as I've laid them out don't really work. I now think that values such as liking a certain process or liking mental properties should be treated as first-class values, and this pretty firmly blurs the telopheme / telophore distinction.

Load More