So for example, say Alice runs this experiment:
Train an agent A in an environment that contains the source B of A's reward.
Alice observes that A learns to hack B. Then she solves this as follows:
Same setup, but now B punishes (outputs high loss) A when A is close to hacking B, according to a dumb tree search that sees whether it would be easy, from the state of the environment, for A to touch B's internals.
Alice observes that A doesn't hack B. The Bob looks at Alice's results and says,
"Cool. But this won't generalize to future lethal systems because it doe...
The main way you produce a treacherous turn is not by "finding the treacherous turn capabilities," it's by creating situations in which sub-human systems have the same kind of motive to engage in a treacherous turn that we think future superhuman systems might have.
When you say "motive" here, is it fair to reexpress that as: "that which determines by what method and in which directions capabilities are deployed to push the world"? If you mean something like that, then my worry here is that motives are a kind of relation involving capabilities, not somet...
Creating in vitro examples of problems analogous to the ones that will ultimately kill us, e.g. by showing agents engaging in treacherous turns due to reward hacking or exhibiting more and more of the core features of deceptive alignment.
A central version of this seems to straightforwardly advance capabilities. The strongest (ISTM) sort of analogy between a current system and a future lethal system would be that they use an overlapping set of generators of capabilities. Trying to find an agent that does a treacherous turn, for the same reasons as a f...
I'm asking what reification is, period, and what it has to do with what's in reality (the thing that bites you regardless of what you think).
How do they explain why it feels like there are noumena? (Also by "feels like" I'd want to include empirical observations of nexusness.)
Things are reified out of sensory experience of the world (though note that "sensory" is redundant here), and the world is the unified non-thing
Okay, but the tabley-looking stuff out there seems to conform more parsimoniously to a theory that posits an external table. I assume we agree on that, and then the question is, what's happening when we so posit?
if you define the central problem as something like building a system that you'd be happy for humanity to defer to forever.
[I at most skimmed the post, but] IMO this is a more ambitious goal than the IMO central problem. IMO the central problem (phrased with more assumptions than strictly necessary) is more like "building system that's gaining a bunch of understanding you don't already have, in whatever domains are necessary for achieving some impressive real-world task, without killing you". So I'd guess that's supposed to happen in step 1. It's debata...
I speculate (based on personal glimpses, not based on any stable thing I can point to) that there's many small sets of people (say of size 2-4) who could greatly increase their total output given some preconditions, unknown to me, that unlock a sort of hivemind. Some of the preconditions include various kinds of trust, of common knowledge of shared goals, and of person-specific interface skill (like speaking each other's languages, common knowledge of tactics for resolving ambiguity, etc.).
[ETA: which, if true, would be good to have already set up before crunch time.]
I agree that the epistemic formulation is probably more broadly useful, e.g. for informed oversight. The decision theory problem is additionally compelling to me because of the apparent paradox of having a changing caring measure. I naively think of the caring measure as fixed, but this is apparently impossible because, well, you have to learn logical facts. (This leads to thoughts like "maybe EU maximization is just wrong; you don't maximize an approximation to your actual caring function".)
In case anyone shared my confusion:
The while loop where we ensure that eps is small enough so that
bound > bad1() + (next - this) * log((1 - p1) / (1 - p1 - eps))
is technically necessary to ensure that bad1() doesn't surpass bound, but it is immaterial in the limit. Solving
bound = bad1() + (next - this) * log((1 - p1) / (1 - p1 - eps))
gives
eps >= (1/3) (1 - e^{ -[bound - bad1()] / [next - this]] })
which, using the log(1+x) = x approximation, is about
(1/3) ([bound - bad1()] / [next - this] ).
Then Scott's comment gives the rest. I was worried about the
...Could you spell out the step
every iteration where mean(𝙴[𝚙𝚛𝚎𝚟:𝚝𝚑𝚒𝚜])≥2/5 will cause bound - bad1() to grow exponentially (by a factor of 11/10=1+(1/2)(−1+2/5𝚙𝟷))
a little more? I don't follow. (I think I follow the overall structure of the proof, and if I believed this step I would believe the proof.)
We have that eps is about (2/3)(1-exp([bad1() - bound]/(next-this))), or at least half that, but I don't see how to get a lower bound on the decrease of bad1() (as a fraction of bound-bad1() ).
Why isn't the answer obviously "yes"? What would it look like for this not to be the case? (I'm generally somewhat skeptical of descriptions like "just faster" if the faster is like multiple orders of magnitude and sure seems to result from new ideas rather than just a bigger computer.)