johnswentworth

Sequences

Basic Foundations for Agent Models

Wiki Contributions

Comments

Optimization at a Distance

Exactly! That's an optimization-at-a-distance style intuition. The optimizer (e.g. human) optimizes things outside of itself, at some distance from itself.

A rock can arguably be interpreted as optimizing itself, but that's not an interesting kind of "optimization", and the rock doesn't optimize anything outside itself. Throw it in a room, the room stays basically the same.

[Intro to brain-like-AGI safety] 14. Controlled AGI

I thought delegation-to-GPT-N was a central part of the story: i.e., maybe GPT-N knew that the designs could be used for bombs, but it didn't care to tell the human, because the human didn't ask. But from what you're saying now, I guess GPT-N has nothing to do with the story?

Basically, yeah.

The important point (for current purposes) is that, as the things-the-system-is-capable-of-doing-or-building scale up, we want the system's ability to notice subtle problems to scale up with it. If the system is capable of designing complex machines way outside what humans know how to reason about, then we need similarly-superhuman reasoning about whether those machines will actually do what a human intends. "With great power comes great responsibility" - cheesy, but it fits.

[Intro to brain-like-AGI safety] 14. Controlled AGI

An example might be helpful here: consider the fusion power generator scenario. In that scenario, a human thinking about what they want arrives at the wrong answer, not because of uncertainty about their own values, but because they don't think to ask the right questions about how the world works. That's the sort of thing I have in mind.

In order to handle that sort of problem, an AI has to be able to use human values somehow without carrying over other specifics of how a human would reason about the situation.

I don’t think “the human is deciding whether or not she cares about Ems” is a different set of mental activities from “the human is trying to make sense of a confusing topic”, or “the human is trying to prove a theorem”, etc.

I think I disagree with this claim. Maybe not exactly as worded - like, sure, maybe the "set of mental activities" involved in the reasoning overlap heavily. But I do expect (weakly, not confidently) that there's a natural notion of human-value-generator which factors from the rest of human reasoning, and has a non-human-specific API (e.g. it interfaces with natural abstractions).

So from my perspective, what you said sounds like “Write code for a Social-Instinct AGI, and then stamp the word subroutine on that code, and then make an “outer AI” with the power to ‘query’ that ‘subroutine’.”

It sounds to me like you're imagining something which emulates human reasoning to a much greater extent than I'm imagining.

[Intro to brain-like-AGI safety] 14. Controlled AGI

We don't necessarily need the AGI itself to have human-like drives, intuitions, etc. It just needs to be able to model the human reasoning algorithm well enough to figure out what values humans assign to e.g. an em.

(I expect an AI which relied heavily on human-like reasoning for things other than values would end up doing something catastrophically stupid, much as humans are prone to do.)

[Intro to brain-like-AGI safety] 14. Controlled AGI

I expect that there will be concepts the AI finds useful which humans don't already understand. But these concepts should still be of the same type as human concepts - they're still the same kind of natural abstraction. Analogy: a human who grew up in a desert tribe with little contact with the rest of the world may not have any concept of "snow", but snow is still the kind-of-thing they're capable of understanding if they're ever exposed to it. When the AI uses concepts humans don't already have, I expect them to be like that.

As long as the concepts are the type of thing humans can recognize/understand, then it should be conceptually straightforward to model how humans would reason about those concepts or value them.

[Intro to brain-like-AGI safety] 14. Controlled AGI

This part of Proof Strategy 1 is a basically-accurate description of what I'm working towards:

We try to come up with an unambiguous definition of what [things] are, and from there we can have an unambiguous language for specifying what we want to happen in the world. We also somehow translate (or constrain) the AGI’s understanding of the world into that language, and now we can prove theorems about what the AGI is trying to do.

... it's just not necessarily about objects localized in 3D space.

Also, there's several possible paths, and they don't all require unambiguous definitions of all the "things" in a human's ontology. For instance, if corrigibility turns out to be a natural "thing", that could short-circuit the need for a bunch of other rigorous concepts.

[Intro to brain-like-AGI safety] 14. Controlled AGI

Proof strategy #1 starts with the idea that we live in a three-dimensional world containing objects and so on. We try to come up with an unambiguous definition of what those objects are, and from there we can have an unambiguous language for specifying what we want to happen in the world. We also somehow translate (or constrain) the AGI’s understanding of the world into that language, and now we can prove theorems about what the AGI is trying to do.

This is my tentative understanding of what John Wentworth is trying to do via his Natural Abstraction Hypothesis research program (most recent update here), and I’ve heard ideas in this vicinity from a couple other people as well.

I’m skeptical because a 3D world of localized objects seems to be an unpromising starting point for stating and proving useful theorems about the AGI’s motivations. After all, a lot of things that we humans care about, and that the AGI needs to care about, seem difficult to describe in terms of a 3D world of localized objects—consider the notion of “honesty”, or “solar cell efficiency”, or even “daytime”. 

With my current best formalization, the "objects" in the world are not necessarily localized in 3D space. Indeed, one of the main things which makes an abstraction "natural" is that the relevant information is redundantly represented in many places in the physical world.

"Daytime" is a good example: I can measure light intensity at lots of different places in my general area, at lots of different times, and find that they all strongly correlate. The information about light intensity is redundant across all those locations: if I measure high light intensity outside my house, then I'm pretty confident that a measurement taken outside the office at the same time will also have high intensity. The latent variable representing that redundant information (as a function of time) is what we call "daytime".

[Request for Distillation] Coherence of Distributed Decisions With Different Inputs Implies Conditioning

to be paid out to anyone you think did a fine job of distilling the thing

Needing to judge submissions is the main reason I didn't offer a bounty myself. Read the distillation, and see if you yourself understand it. If "Coherence of Distributed Decisions With Different Inputs Implies Conditioning" makes sense as a description of the idea, then you've probably understood it.

If you don't understand it after reading an attempted distillation, then it wasn't distilled well enough.

[$20K in Prizes] AI Safety Arguments Competition

I'd like to complain that this project sounds epistemically absolutely awful. It's offering money for arguments explicitly optimized to be convincing (rather than true), it offers money only for prizes making one particular side of the case (i.e. no money for arguments that AI risk is no big deal), and to top it off it's explicitly asking for one-liners.

I understand that it is plausibly worth doing regardless, but man, it feels so wrong having this on LessWrong.

[Request for Distillation] Coherence of Distributed Decisions With Different Inputs Implies Conditioning

Short answer: about one full day.

Longer answer: normally something like this would sit in my notebook for a while, only informing my own thinking. It would get written up as a post mainly if it were adjacent to something which came up in conversation (either on LW or in person). I would have the idea in my head from the conversation, already be thinking about how best to explain it, chew on it overnight, and then if I'm itching to produce something in the morning I'd bang out the post in about 3-4 hours.

Alternative paths: I might need this idea as background for something else I'm writing up, or I might just be in a post-writing mood and not have anything more ready-to-go. In either of those cases, I'd be starting more from scratch, and it would take about a full day.

Load More