Agent-foundations researcher. Working on Synthesizing Standalone World-Models, aiming at a technical solution to the AGI risk fit for worlds where alignment is punishingly hard and we only get one try.
Currently looking for additional funders ($1k+, details). Consider reaching out if you're interested, or donating directly.
Or get me to pay you money ($5-$100) by spotting holes in my agenda or providing other useful information.
Uhh, well, technically I wrote that sentence as a conditional, and technically I didn’t say whether or not the condition applied to you-in-particular.
I'll take "Steven Byrnes doesn't consider it necessary to immediately write a top-level post titled 'Synthesizing Standalone World-Models has an unsolved technical alignment problem".
The idea that there's a simple state in the future, that still pins down the entire past, seems possible but weird
Laws of physics under the Standard Model are reversible though, aren't they? I think you can't do it from within an Everett branch, because some information ends up in inaccessible-to-you parts of the universal wavefunction, but if you had access to the wavefunction itself, you would've been able to run it in reverse. So under the Standard Model, future states do pin down the entire past.
One thing that's confusing to me: Why K-complexity of the low-level history?
Hm; frankly, simply because it's the default I ran with.
Why not, for example, Algorithmic Minimal Sufficient Statistic, which doesn't count the uniform noise?
That seems like an acceptable fit. It's defined through Kolmogorov complexity anyway, though; would it produce any qualitatively different conclusions here?
I think I prefer frequentist justifications for complexity priors, because they explain why it works even on small parts of the universe
Interesting. Elaborate?
The laws of physics at the lowest level + initial conditions are sufficient to roll out the whole history, so (in K-complexity) there's no benefit to adding descriptions of the higher levels.
Unless high-level structure lets you compress the initial conditions themselves, no?
Self-arguing on the topic
Counter-argument: The initial state had no structure we could exploit for compression, pure chaos.
Counter²-argument: Any given history that ended up well-abstracting corresponds to a specific inhomogeneous distribution of mass in the early universe, which defined the way the galaxies are spread across it. At least that seems to be the step that could already be compressed. If there were a step upstream of it where the state really didn't have any structure, that unstructured state could be generated by describing the post-structure-formation state, describing the "timestamp" of the post-structure-formation state, then running physics in reverse to generate the unstructured state. So unless the later structured state fails to be lower-description-length than the earlier unstructured state, structure/abstractibility should still allow you to compress the initial state's description, even if the structure only appears later.
Counter³-argumnent: The real "initial state" is the initial state of the quantum multiverse from which all possible Everett branches (and so all possible inhomogeneous distributions of mass, etc.) are generated. Its description length could be incredibly low, such as a uniform point singularity with no expensive-to-describe inhomogeneities whatsoever. The bits you later have to spend to describe the state of your universe are effectively spent on pinpointing the specific Everett branch you're in, but the actual algorithm generating the whole Tegmark III multiverse did not have to do that. It just described the simple state from which all possible branches descend.
Counter⁴-argument: My understanding is that under QM/QFT, the universe doesn't start from a singularity; it's a general-relativity thing. QM/QFT require an initial inhomogeneous universal wavefunction to start working.
Counter⁵-argument: Perhaps the real Theory of Everything unifying QFT and GR would have an initial homogeneous singularity from which all possible Everett branches are generated, and this end result seems plausible enough that we may as well assume it right now.
I don't know enough fundamental physics to make a confident call here. Though...
Counter⁶-argument: There seems to be some process which reallocates realityfluid within Tegmark III as well, between Everett branches. I think this is a hint that the "Tegmark III entire is a single program for the purposes of anthropics/from Tegmark IV's point of view" idea is somehow wrong.
Wait, none of that actually helps; you're right. If we can specify the full state of the universe/multiverse at any one moment, the rest of its history can be generated from that moment. To do so most efficiently, we should pick the simplest-to-describe state, and there we would benefit from having some structure. But as long as we have one simple-to-describe state, we can have all the other states be arbitrarily unstructured, with no loss of simplicity. So what we should expect is a history with at least one moment of structure (e. g., the initial conditions) that can then immediately dissolve into chaos.
To impose structure on the entire history, we do have to introduce some source of randomness that interferes in the state-transition process, making it impossible to deterministically compute later states from early ones. I. e., the laws of physics themselves have to be "incomplete"/stochastic, such that they can't be used as the decompression algorithm. I do have some thoughts on why that may (effectively) be the case, but they're on a line of reasoning I don't really trust.
... Alternatively, what if the most compact description of the lowest-level state at any given moment routes through describing the entire multi-level history? I. e., what if even abstractions that exist in the distant future shed some light at the present lowest-level state, and they do so in a way that's cheaper than specifying the lowest-level state manually?
Suppose the state is parametrized by real numbers. As it evolves, ever-more-distant decimal digits become relevant. This means that, if you want to simulate this universe on a non-analog computer (i. e., a computer that doesn't use unlimited-precision reals) from to starting from some initial state , with the simulation error never exceeding some value, the precision with which you have to specify scales with . Indeed, as goes to infinity, so does the needed precision (i. e., the description length).
Given all that, is it plausible that far-future abstractions summarize redundant information stored in the current state? Such that specifying the lowest-level state up to the needed precision is cheaper by describing the future history, rather than by manually specifying the position of every particle (or, rather, the finer details of the universal wavefunction).
... Yes, I think? Like, consider the state , with some high-level system existing in it. Suppose we want to infer from . How much information does tell us about ? Intuitively, quite a lot: for to end up arising, many fine details in the distant past had to line up just right. Thus, knowing about likely gives us more bits about the exact low-level past state than the description length of itself.
Ever-further-in-the-future high-level abstractions essentially serve as compressed information about sets of ever-more-distant decimal-expansion digits of past lowest-level states. As long as an abstraction takes fewer bits to specify than the bits it communicates about the initial conditions, its presence decreases that initial state's description length.
This is basically just the scaled-up version of counter²-argument from the collapsible. If an unstructured state deterministically evolves into a structured state, those future structures are implicit in its at-a-glance-unstructured form. Thus, the more simple-to-describe high-level structures a state produces across its history, the simpler it itself is to describe. So if we want to run a universe from to with a bounded simulation error, the simplest initial conditions would impose the well-abstractibility property on the whole 0-to-n interval. That recovers the property I want.
Main diff with your initial argument: the idea that the description length of the lowest-level state at any given moment effectively scales with the length of history you want to model, rather than being constant-and-finite. This makes it a question of whether any given additional period of future history is cheaper to specify by directly describing the desired future multi-level abstract state, or by packing that information into the initial conditions; and the former seems cheaper.
All that reasoning is pretty raw, obviously. Any obvious errors there?
Also, this is pretty useful. For bounty purposes, I'm currently feeling $20 on this one; feel free to send your preferred payment method via PMs.
There are effectively infinitely many things about the world that one could figure out
One way to control that is to control the training data. We don't necessarily have to point the wm-synthesizer at the Pile indiscriminately,[1] we could assemble a dataset about a specific phenomenon we want to comprehend.
if we’re talking about e.g. possible inventions that don’t exist yet, then the combinatorial explosion of possibilities gets even worse
Human world-models are lazy: they store knowledge in the maximally "decomposed" form[2], and only synthesize specific concrete concepts when they're needed. (E. g., "a triangular lightbulb", which we could easily generate – which our world-models effectively "contain" – but which isn't generated until needed.)
I expect inventions are the same thing. Given a powerful-enough world-model, we should be able to produce what we want just by using the world-model's native functions for that. Pick the needed concepts, plug them into each other in the right way, hit "run".
If constructing the concepts we want requires agency, the one contributing it could be the human operator, if they understand how the world-model works well enough.
Will e-mail regarding the rest.
It’s funny that I’m always begging people to stop trying to reverse-engineer the neocortex, and you’re working on something that (if successful) would end up somewhere pretty similar to that, IIUC
The irony is not lost on me. When I was reading your Foom & Doom posts, and got to this section, I did have a reaction roughly along those lines.
(But hmm, I guess if a paranoid doom-pilled person was trying to reverse-engineer the neocortex and only publish the results if they thought it would help with safe & beneficial AGI, and if they in fact had good judgment on that question, then I guess I’d be grudgingly OK with that.)
I genuinely appreciate the sanity-check and the vote of confidence here!
Indeed, we might want to actively avoid that.
Perhaps something along the lines of the constructive-PID thing I sketched out.
Sooo, apparently OpenAI's mysterious breakthrough technique for generalizing RL to hard-to-verify domains that scored them IMO gold is just... "use the LLM as a judge"? Sources: the main one is paywalled, but this seems to capture the main data, and you can also search for various crumbs here and here.
The technical details of how exactly the universal verifier works aren’t yet clear. Essentially, it involves tasking an LLM with the job of checking and grading another model’s answers by using various sources to research them.
My understanding is that they approximate an oracle verifier by an LLM with more compute and access to more information and tools, then train the model to be accurate by this approximate-oracle's lights.
Now, it's possible that the journalists are completely misinterpreting the thing they're reporting on, or that it's all some galaxy-brained OpenAI op to mislead the competition. It's also possible that there's some incredibly clever trick for making it work much better than how it sounds like it'd work.
But if that's indeed the accurate description of the underlying reality, that's... kind of underwhelming. I'm curious how far this can scale, but I'm not feeling very threatened by it.
(Haven't seen this discussed on LW, kudos to @lwreader132 for bringing it to my attention.)
It seems to me that many disagreements regarding whether the world can be made robust against a superintelligent attack (e. g., the recent exchange here) are downstream of different people taking on a mathematician's vs. a hacker's mindset.
A mathematician might try to transform a program up into successively more abstract representations to eventually show it is trivially correct; a hacker would prefer to compile a program down into its most concrete representation to brute force all execution paths & find an exploit trivially proving it incorrect.
Imagine the world as a multi-level abstract structure, with different systems (biological cells, human minds, governments, cybersecurity systems, etc.) implemented on different abstraction layers.
The mindsets also then differ regarding what they expect ASI to be good at.
"Mathematicians" expect really sophisticated within-layer performance: really good technology, really good logistics, really good rhetoric, et cetera. This can still make an ASI really, really powerful, powerful enough to defeat all of humanity combined. But ultimately, in any given engagement, ASI plays "by the rules", in a certain abstract sense. Each of its tools can in-principle be defended-against on the terms of the abstraction layer at which they're deployed. All it would take is counter-deploying systems that are sufficiently theoretically robust, and doing so on all abstraction layers simultaneously. Very difficult, but ultimately doable, and definitely not hopeless.
"Hackers" expect really good generalized hacking. No amount of pre-superintelligent preparation is going to suffice against it, because any given tool we deploy, any given secure system we set up, would itself have implementation-level holes in it that the ASI's schemes would be able to worm through. It may at best delay the ASI for a little bit, but the attack surface is too high-dimensional, and the ASI is able to plot routes through that high-dimensional space which we can't quite wrap our head around.
As you might've surmised, I favour the hacker mindset here.
Now, arguably, any given plot to compromise an abstraction layer is itself deployed from within some other abstraction layer, so a competent mathematician's mindset shouldn't really be weaker than a hacker's. For example, secure software is made insecure by exploiting hardware vulnerabilities, and "defend against hardware vulnerabilities" is something a mathematician is perfectly able to understand and execute on. Same for securing against Basilisk hacks, logistical sabotage, etc.
But the mathematician is still, in some sense, "not getting it"; still centrally thinks in terms of within-layer attacks, rather than native cross-layer attacks.
One core thing here is that a cross-layer attack doesn't necessarily look like a meaningful attack within the context of any one layer. For example, there's apparently an exploit where you modulate the RPM of a hard drive in order to exfiltrate data from an airgapped server using a microphone. By itself, placing a microphone next to an airgapped server isn't a "hardware attack" in any meaningful sense (especially if it doesn't have dedicated audio outputs), and some fiddling with a hard drive's RPM isn't a "software attack" either. Taken separately, within each layer, both just look like random actions. You therefore can't really discover (and secure against) this type of attack if, in any given instance, you reason in terms of a single abstraction layer.
So I think a hacker's mindset is the more correct way to look at the problem.
And, looking at things from within a hacker's mindset, I think it's near straight-up impossible for a non-superintelligence to build any nontrivially complicated system that would be secure against a superintelligent attack.
Like... Humanity vs. ASI is sometimes analogized to a chess battle, with one side arguing that Stockfish is guaranteed to beat any human, even if you don't know the exact sequence of moves it will play, and the other side joking that the human can just flip the board.
But, uh. In this metaphor, the one coming up with the idea to flip the board[1], instead of playing by the rules, would be the ASI, not the human.
Or, perhaps, to execute a pattern of chess-piece moves which, as the human reasons about them, push them onto trains of thought that ultimately trigger a trauma response in the human, causing them to resign.
I agree that it's a promising direction.
I did actually try a bit of that back in the o1 days. What I've found is that getting LLMs to output formal Lean proofs is pretty difficult: they really don't want to do that. When they're not making mistakes, they use informal language as connective tissue between Lean snippets, they put in "sorry"s (a placeholder that makes a lemma evaluate as proven), and otherwise try to weasel out of it.
This is something that should be solvable by fine-tuning, but at the time, there weren't any publicly available decent models fine-tuned for that.
We do have DeepSeek-Prover-V2 now, though. I should look into it at some point. But I am not optimistic, sounds like it's doing the same stuff, just more cleverly.
Relevant: Terence Tao does find them helpful for some Lean-related applications.
(Disclaimer: only partially relevant rant.)
Outside of [coding], I don't know of it being more than a somewhat better google
I've recently tried heavily leveraging o3 as part of a math-research loop.
I have never been more bearish on LLMs automating any kind of research than I am now.
And I've tried lots of ways to make it work. I've tried telling it to solve the problem without any further directions, I've tried telling it to analyze the problem instead of attempting to solve it, I've tried dumping my own analysis of the problem into its context window, I've tried getting it to search for relevant lemmas/proofs in math literature instead of attempting to solve it, I've tried picking out a subproblem and telling it to focus on that, I've tried giving it directions/proof sketches, I've tried various power-user system prompts, I've tried resampling the output thrice and picking the best one. None of this made it particularly helpful, and the bulk of the time was spent trying to spot where it's lying or confabulating to me in its arguments or proofs (which it ~always did).
It was kind of okay for tasks like "here's a toy setup, use a well-known formula to compute the relationships between A and B", or "try to rearrange this expression into a specific form using well-known identities", which are relatively menial and freed up my working memory for more complicated tasks. But it's pretty minor usefulness (and you have to re-check the outputs for errors anyway).
I assume there are math problems at which they do okay, but that capability sure is brittle. I don't want to overupdate here, but geez, getting LLMs from here to the Singularity in 2-3 years just doesn't feel plausible.
Fully agree with everything in this post, this is exactly my model as well. (That's the reason behind my last-line rug-pull here, by the way.)
Some new data on that point:
To summarize what the paper argues (from my post in that thread):
I. e., it is effectively the case that there's (pseudo)randomness injected into the state-transition process.
And if a given state has some other regularities by which it could be compactly defined, aside from defining it through the initial conditions, that would indeed decrease its description length/algorithmic entropy. So we again recover the "trajectories that abstract well throughout their entire history are simpler" claim.