All of TsviBT's Comments + Replies

An interesting question I don't know the answer to is if you get more cognitive empathy past the end of where human psychological development seems to stop.

Why isn't the answer obviously "yes"? What would it look like for this not to be the case? (I'm generally somewhat skeptical of descriptions like "just faster" if the faster is like multiple orders of magnitude and sure seems to result from new ideas rather than just a bigger computer.)

1G Gordon Worley III2d
So there's different notions of more here. There's more in the sense I'm thinking in that it's not clear additional levels of abstraction enable deeper understanding given enough time. If 3 really is all the more levels you need because that's how many it takes to think about any number of levels of depth (again by swapping out levels in your "abstraction registers"), additional levels end up being in the same category. And then there's more like doing things faster which makes things cheaper. I'm more skeptical of scaling than you are perhaps. I do agree that many things become cheap at scale that are too expensive to do otherwise, and that does produce a real difference. I'm doubtful in my comment of the former kind of more. The latter type seems quite likely.

So for example, say Alice runs this experiment:

Train an agent A in an environment that contains the source B of A's reward.

Alice observes that A learns to hack B. Then she solves this as follows:

Same setup, but now B punishes (outputs high loss) A when A is close to hacking B, according to a dumb tree search that sees whether it would be easy, from the state of the environment, for A to touch B's internals.

Alice observes that A doesn't hack B. The Bob looks at Alice's results and says,

"Cool. But this won't generalize to future lethal systems because it doe... (read more)

The main way you produce a treacherous turn is not by "finding the treacherous turn capabilities," it's by creating situations in which sub-human systems have the same kind of motive to engage in a treacherous turn that we think future superhuman systems might have.

When you say "motive" here, is it fair to reexpress that as: "that which determines by what method and in which directions capabilities are deployed to push the world"? If you mean something like that, then my worry here is that motives are a kind of relation involving capabilities, not somet... (read more)

2Paul Christiano12d
I think if you train AI systems to select actions that will lead to high reward, they will sometimes learn policies that behave well until they are able to overpower their overseers, at which point they will abruptly switch to the reward hacking strategy to get a lot of reward. I think there will be many similarities between this phenomenon in subhuman systems and superhuman systems. Therefore by studying and remedying the problem for weak systems overpowering weak overseers, we can learn a lot about how to identify and remedy it for stronger systems overpowering stronger overseers. I'm not exactly sure how to cash out your objection as a response to this, but I suspect it's probably a bit too galaxy-brained for my taste.

Creating in vitro examples of problems analogous to the ones that will ultimately kill us, e.g. by showing agents engaging in treacherous turns due to reward hacking or exhibiting more and more of the core features of deceptive alignment.


A central version of this seems to straightforwardly advance capabilities. The strongest (ISTM) sort of analogy between a current system and a future lethal system would be that they use an overlapping set of generators of capabilities. Trying to find an agent that does a treacherous turn, for the same reasons as a f... (read more)

4Paul Christiano12d
The main way you produce a treacherous turn is not by "finding the treacherous turn capabilities," it's by creating situations in which sub-human systems have the same kind of motive to engage in a treacherous turn that we think future superhuman systems might have. There are some differences and lots of similarities between what is going on in a weaker AI doing a treacherous turn and a stronger AI doing a treacherous turn. So you expect to learn some things and not others. After studying several such cases it seems quite likely you understand enough to generalize to new cases. It's possible MIRI folks expect a bigger difference in how future AI is produced. I mostly expect just using gradient descent, resulting in minds that are in some ways different and in many ways different. My sense is that MIRI folks have a more mystical view about the difference between subhuman AI systems and "AGI." (The view "stack more layers won't ever give you true intelligence, there is a qualitative difference here" seems like it's taking a beating every year, whether it's Eliezer or Gary Marcus saying it.)

I'm asking what reification is, period, and what it has to do with what's in reality (the thing that bites you regardless of what you think).

1G Gordon Worley III22d
This seems straightforward to me: reification is a process by which our brain picks out patterns/features and encodes them so we can recognize them again and make sense of the world given our limited hardware. We can then think in terms of those patterns and gloss over the details because the details often aren't relevant for various things. The reason we reify things one way versus another depends on what we care about, i.e. our purposes. []

How do they explain why it feels like there are noumena? (Also by "feels like" I'd want to include empirical observations of nexusness.)

1G Gordon Worley III22d
To me this seems obvious: noumena feel real to most people because they're captured by their ontology. It takes a lot of work for a human mind to learn not to jump straight from sensation to reification, and even with training there's only so much a person can do because the mind has lots of low-level reification "built in" that happens prior to conscious awareness. Cf. noticing []

Things are reified out of sensory experience of the world (though note that "sensory" is redundant here), and the world is the unified non-thing

Okay, but the tabley-looking stuff out there seems to conform more parsimoniously to a theory that posits an external table. I assume we agree on that, and then the question is, what's happening when we so posit?

1G Gordon Worley III1mo
Yep, so I think this gets into a different question of epistemology not directly related to things but rather about what we care about, since positing a theory that what looks to me like a table implies something table shaped about the universe requires caring about parsimony. (Aside: It's kind of related because to talk about caring about things we need reifications that enable us to point to what we care about, but I think that's just an artifact of using words—care is patterns of behavior and preference we can reify call "parsimonious" or something else, but exist prior to being named.) If we care about something other than parsimony, we may not agree that the universe is filled with tables. Maybe we slice it up quite differently and tables exist orthogonal to our ontology.

if you define the central problem as something like building a system that you'd be happy for humanity to defer to forever.

[I at most skimmed the post, but] IMO this is a more ambitious goal than the IMO central problem. IMO the central problem (phrased with more assumptions than strictly necessary) is more like "building system that's gaining a bunch of understanding you don't already have, in whatever domains are necessary for achieving some impressive real-world task, without killing you". So I'd guess that's supposed to happen in step 1. It's debata... (read more)

2davidad (David A. Dalrymple)2mo
I’d say the scientific understanding happens in step 1, but I think that would be mostly consolidating science that’s already understood. (And some patching up potentially exploitable holes where AI can deduce that “if this is the best theory, the real dynamics must actually be like that instead”. But my intuition is that there aren’t many of these holes, and that unknown physics questions are mostly underdetermined by known data, at least for quite a long way toward the infinite-compute limit of Solomonoff induction, and possibly all the way.) Engineering understanding would happen in step 2, and I think engineering is more “the generator of large effects on the world,” the place where much-faster-than-human ingenuity is needed, rather than hoping to find new science. (Although the formalization of the model of scientific reality is important for the overall proposal—to facilitate validating that the engineering actually does what is desired—and building such a formalization would be hard for unaided humans.)

I speculate (based on personal glimpses, not based on any stable thing I can point to) that there's many small sets of people (say of size 2-4) who could greatly increase their total output given some preconditions, unknown to me, that unlock a sort of hivemind. Some of the preconditions include various kinds of trust, of common knowledge of shared goals, and of person-specific interface skill (like speaking each other's languages, common knowledge of tactics for resolving ambiguity, etc.).
[ETA: which, if true, would be good to have already set up before crunch time.]

I agree that the epistemic formulation is probably more broadly useful, e.g. for informed oversight. The decision theory problem is additionally compelling to me because of the apparent paradox of having a changing caring measure. I naively think of the caring measure as fixed, but this is apparently impossible because, well, you have to learn logical facts. (This leads to thoughts like "maybe EU maximization is just wrong; you don't maximize an approximation to your actual caring function".)

In case anyone shared my confusion:

The while loop where we ensure that eps is small enough so that

bound > bad1() + (next - this) * log((1 - p1) / (1 - p1 - eps))

is technically necessary to ensure that bad1() doesn't surpass bound, but it is immaterial in the limit. Solving

bound = bad1() + (next - this) * log((1 - p1) / (1 - p1 - eps))


eps >= (1/3) (1 - e^{ -[bound - bad1()] / [next - this]] })

which, using the log(1+x) = x approximation, is about

(1/3) ([bound - bad1()] / [next - this] ).

Then Scott's comment gives the rest. I was worried about the

... (read more)

Could you spell out the step

every iteration where mean(𝙴[𝚙𝚛𝚎𝚟:𝚝𝚑𝚒𝚜])≥2/5 will cause bound - bad1() to grow exponentially (by a factor of 11/10=1+(1/2)(−1+2/5𝚙𝟷))

a little more? I don't follow. (I think I follow the overall structure of the proof, and if I believed this step I would believe the proof.)

We have that eps is about (2/3)(1-exp([bad1() - bound]/(next-this))), or at least half that, but I don't see how to get a lower bound on the decrease of bad1() (as a fraction of bound-bad1() ).

1Scott Garrabrant7y
You are correct that you use the fact that 1+eps is at approximately e^(eps). The concrete way this is used in this proof is replacing the ln(1+3eps) you subtract from bad1 when the environment is a 1 with 3eps=(bound - bad1) / (next - this), and replacing the ln(1-3eps/2) you subtract from bad1 when the environment is a 0 with -3eps/2=-(bound - bad1) / (next - this)/2 Therefore, you subtract from bad1 approximately at least (next-this)((2/5)(bound - bad1) / (next - this)-(3/5)*(bound - bad1) / (next - this)/2). This comes out to (bound - bad1)/10. I believe the inequality is the wrong direction to just use e^(eps) as a bound for 1+eps, but when next-this gets big, the approximation gets close enough.