I don't really like the block-universe thing in this context. Here "reversible" refers to a time-course that doesn't particularly have to be physical causality; it's whatever course of sequential determination is relevant. E.g., don't cut yourself off from acausal trades.
I think "reversible" definitely needs more explication, but until proven otherwise I think it should be taken on faith that the obvious intuition has something behind it.
Unfortunately, more context is needed.
An LLM solves a mathematical problem by introducing a novel definition which humans can interpret as a compelling and useful concept.
I mean, I could just write a python script that prints out a big list of definitions of the form
"A topological space where every subset with property P also has property Q"
and having P and Q be anything from a big list of properties of subsets of topological spaces. I'd guess some of these will be novel and useful. I'd guess LLMs + some scripting could already take advantage of some o...
What I mean by confrontation-worthy empathy is about that sort of phrase being usable. I mean, I'm not saying it's the best phrase, or a good phrase to start with, or whatever. I don't think inserting Knightian uncertainty is that helpful; the object-level stuff is usually the most important thing to be communicating.
This maybe isn't so related to what you're saying here, but I'd follow the policy of first making it common knowledge that you're reporting your inside views (which implies that you're not assuming that the other person would share those views...
Well, making it pass people's "specific" bar seems frustrating, as I mentioned in the post, but: understand stuff deeply--such that it can find new analogies / instances of the thing, reshape its idea of the thing when given propositions about the thing taken as constraints, draw out relevant implications of new evidence for the ideas.
Like, someone's going to show me an example of an LLM applying modus ponens, or making an analogy. And I'm not going to care, unless there's more context; what I'm interested in is [that phenomenon which I understand at most ...
I'm not really sure whether or not we disagree. I did put "3%-10% probability of AGI in the next 10-15ish years".
I think the following few years will change this estimate significantly either way.
Well, I hope that this is a one-time thing. I hope that if in a few years we're still around, people go "Damn! We maybe should have been putting a bit more juice into decades-long plans! And we should do so now, though a couple more years belatedly!", rather than going "This time for sure!" and continuing to not invest in the decades-long plans. My impression ...
I think the current wave is special, but that's a very far cry from being clearly on the ramp up to AGI.
Then the third part needs only to hook together the other two parts with its goals to become an actualizing agent.
Basically just this? It would be hooking a lot more parts together. What makes it seem wildfirey to me is
I'm skeptical that there would be any such small key to activate a large/deep mechanism. Can you give a plausibility argument for why there would be?
Not really, because I don't think it's that likely to exist. There are other routes much more likely to work though. There's a bit of plausibility to me, mainly because of the existence of hormones, and generally the existence of genomic regulatory networks.
Why wouldn't we have evolved to have the key trigger naturally sometimes?
We do; they're active in childhood. I think.
That seems like a real thing, though I don't know exactly what it is. I don't think it's either unboundedly general or unboundedly ambitious, though. (To be clear, this is isn't very strongly a critique of anyone; general optimization is really hard, because it's asking you to explore a very rich space of channels, and acting with unbounded ambition is very fraught because of unilateralism and seeing like a state and creating conflict and so on.) Another example is: how many people have made a deep and empathetic exploration of why [people doing work that ...
I don't think so, not usually. What happens after they join the EA club? My observations are more consistent with people optimizing (or sometimes performing to appear as though they're optimizing) through a fairly narrow set of channels. I mean, humans are in a weird liminal state, where we're just smart enough to have some vague idea that we ought to be able to learn to think better, but not smart and focused enough to get very far with learning to think better. More obviously, there's anti-interest in biological intelligence enhancement, rather than interest.
Good point, though I think it's a non-fallacious enthymeme. Like, we're talking about a car that moves around under its own power, but somehow doesn't have parts that receive, store, transform, and release energy and could be removed? Could be. The mind could be an obscure mess where nothing is factored, so that a cancerous newcomer with read-write access can't get any work out of the mind other than through the top-level interface. I think that explicitness (https://www.lesswrong.com/posts/KuKaQEu7JjBNzcoj5/explicitness) is a very strong general tendency ...
I feel like none of these historical precedents is a perfect match. It might be valuable to think about the ways in which they are similar and different.
To me a central difference, suggested by the word "strategic", is that the goal pursuit should be
By unboundedly ambitious I mean "has an unbounded ambit" (ambit = "the area went about in; the realm of wandering" https://en.wiktionary.org/wiki/ambit#Etymology ), i.e. its goals induce it to pursue unboundedly much control over the world.
By unboundedly gen...
Yes, I think there's stuff that humans do that's crucial for what makes us smart, that we have to do in order to perform some language tasks, and that the LLM doesn't do when you ask it to do those tasks, even when it performs well in the local-behavior sense.
If a mind comes to understand a bunch of stuff, there's probably some compact reasons that it came to understand a bunch of stuff. What could such reasons be? The mind might copy a bunch of understanding from other minds. But if the mind becomes much more capable than surrounding minds, that's not the reason, assuming that much greater capabilities required much more understanding. So it's some other reason. I'm describing this situation as the mind being on a trajectory of creativity.
(Sorry, I didn't get this on two readings. I may or may not try again. Some places I got stuck:
Are you saying that by pretending really hard to be made of entirely harmless elements (despite actually behaving with large and hence possibly harmful effects), an AI is also therefore in effect trying to prevent all out-of-band effects of its components / mesa-optimizers / subagents / whatever? This still has the basic alignment problem: I don't know how to make the AI be very intently trying to X, including where X = pretending really hard that whatever.
Or are...
That was one of the examples I had in mind with this post, yeah. (More precisely, I had in mind defenses of HCH being aligned that I heard from people who aren't Paul. I couldn't pass Paul's ITT about HCH or similar.)
Yeah, I think that roughly lines up with my example of "generator of large effects". The reason I'd rather say "generator of large effects" rather than "trying" is that "large effects" sounds slightly more like something that ought to have a sort of conservation law, compared to "trying". But both our examples are incomplete in that the supposed conservation law (which provides the inquisitive force of "where exactly does your proposal deal with X, which it must deal with somewhere by conservation") isn't made clear.
I don't recall seeing that theory in the first quarter of the book, but I'll look for it later. I somewhat agree with your description of the difference between the theories (at least, as I imagine a predictive processing flavored version). Except, the theories are more similar than you say, in that FIAT would also allow very partial coherentifying, so that it doesn't have to be "follow these goals, but allow these overrides", but can rather be, "make these corrections towards coherence; fill in the free parameters with FIAT goals; leave all the other inco...
Thanks. Your comments make sense to me I think. But, these essays are more like research notes than they are trying to be good exposition, so I'm not necessarily trying to consistenly make them accessible. I'll add a note to that effect in future.
Yeah, that could produce an example of Doppelgängers. E.g. if an autist (in your theory) later starts using that machinery more heavily. Then there's the models coming from the general-purpose analysis, and the models coming from the intuitive machinery, and they're about the same thing.
An interesting question I don't know the answer to is if you get more cognitive empathy past the end of where human psychological development seems to stop.
Why isn't the answer obviously "yes"? What would it look like for this not to be the case? (I'm generally somewhat skeptical of descriptions like "just faster" if the faster is like multiple orders of magnitude and sure seems to result from new ideas rather than just a bigger computer.)
So for example, say Alice runs this experiment:
Train an agent A in an environment that contains the source B of A's reward.
Alice observes that A learns to hack B. Then she solves this as follows:
Same setup, but now B punishes (outputs high loss) A when A is close to hacking B, according to a dumb tree search that sees whether it would be easy, from the state of the environment, for A to touch B's internals.
Alice observes that A doesn't hack B. The Bob looks at Alice's results and says,
"Cool. But this won't generalize to future lethal systems because it doe...
The main way you produce a treacherous turn is not by "finding the treacherous turn capabilities," it's by creating situations in which sub-human systems have the same kind of motive to engage in a treacherous turn that we think future superhuman systems might have.
When you say "motive" here, is it fair to reexpress that as: "that which determines by what method and in which directions capabilities are deployed to push the world"? If you mean something like that, then my worry here is that motives are a kind of relation involving capabilities, not somet...
Creating in vitro examples of problems analogous to the ones that will ultimately kill us, e.g. by showing agents engaging in treacherous turns due to reward hacking or exhibiting more and more of the core features of deceptive alignment.
A central version of this seems to straightforwardly advance capabilities. The strongest (ISTM) sort of analogy between a current system and a future lethal system would be that they use an overlapping set of generators of capabilities. Trying to find an agent that does a treacherous turn, for the same reasons as a f...
I'm asking what reification is, period, and what it has to do with what's in reality (the thing that bites you regardless of what you think).
How do they explain why it feels like there are noumena? (Also by "feels like" I'd want to include empirical observations of nexusness.)
Things are reified out of sensory experience of the world (though note that "sensory" is redundant here), and the world is the unified non-thing
Okay, but the tabley-looking stuff out there seems to conform more parsimoniously to a theory that posits an external table. I assume we agree on that, and then the question is, what's happening when we so posit?
if you define the central problem as something like building a system that you'd be happy for humanity to defer to forever.
[I at most skimmed the post, but] IMO this is a more ambitious goal than the IMO central problem. IMO the central problem (phrased with more assumptions than strictly necessary) is more like "building system that's gaining a bunch of understanding you don't already have, in whatever domains are necessary for achieving some impressive real-world task, without killing you". So I'd guess that's supposed to happen in step 1. It's debata...
I speculate (based on personal glimpses, not based on any stable thing I can point to) that there's many small sets of people (say of size 2-4) who could greatly increase their total output given some preconditions, unknown to me, that unlock a sort of hivemind. Some of the preconditions include various kinds of trust, of common knowledge of shared goals, and of person-specific interface skill (like speaking each other's languages, common knowledge of tactics for resolving ambiguity, etc.).
[ETA: which, if true, would be good to have already set up before crunch time.]
I agree that the epistemic formulation is probably more broadly useful, e.g. for informed oversight. The decision theory problem is additionally compelling to me because of the apparent paradox of having a changing caring measure. I naively think of the caring measure as fixed, but this is apparently impossible because, well, you have to learn logical facts. (This leads to thoughts like "maybe EU maximization is just wrong; you don't maximize an approximation to your actual caring function".)
In case anyone shared my confusion:
The while loop where we ensure that eps is small enough so that
bound > bad1() + (next - this) * log((1 - p1) / (1 - p1 - eps))
is technically necessary to ensure that bad1() doesn't surpass bound, but it is immaterial in the limit. Solving
bound = bad1() + (next - this) * log((1 - p1) / (1 - p1 - eps))
gives
eps >= (1/3) (1 - e^{ -[bound - bad1()] / [next - this]] })
which, using the log(1+x) = x approximation, is about
(1/3) ([bound - bad1()] / [next - this] ).
Then Scott's comment gives the rest. I was worried about the
...Could you spell out the step
every iteration where mean(𝙴[𝚙𝚛𝚎𝚟:𝚝𝚑𝚒𝚜])≥2/5 will cause bound - bad1() to grow exponentially (by a factor of 11/10=1+(1/2)(−1+2/5𝚙𝟷))
a little more? I don't follow. (I think I follow the overall structure of the proof, and if I believed this step I would believe the proof.)
We have that eps is about (2/3)(1-exp([bad1() - bound]/(next-this))), or at least half that, but I don't see how to get a lower bound on the decrease of bad1() (as a fraction of bound-bad1() ).
IME a lot of people's stated reasons for thinking AGI is near involve mistaken reasoning and those mistakes can be discussed without revealing capabilities ideas: https://www.lesswrong.com/posts/sTDfraZab47KiRMmT/views-on-when-agi-comes-and-on-strategy-to-reduce