AI ALIGNMENT FORUM
Petrov Day
AF

66
Thane Ruthenis
Ω852241101
Message
Dialogue
Subscribe

Agent-foundations researcher. Working on Synthesizing Standalone World-Models, aiming at a technical solution to the AGI risk fit for worlds where alignment is punishingly hard and we only get one try.

Currently looking for additional funders ($1k+, details). Consider reaching out if you're interested, or donating directly.

Or get me to pay you money ($5-$100) by spotting holes in my agenda or providing other useful information.

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
5Thane Ruthenis's Shortform
1y
34
Synthesizing Standalone World-Models (+ Bounties, Seeking Funding)
Thane Ruthenis3d30

There are effectively infinitely many things about the world that one could figure out

One way to control that is to control the training data. We don't necessarily have to point the wm-synthesizer at the Pile indiscriminately,[1] we could assemble a dataset about a specific phenomenon we want to comprehend.

if we’re talking about e.g. possible inventions that don’t exist yet, then the combinatorial explosion of possibilities gets even worse

Human world-models are lazy: they store knowledge in the maximally "decomposed" form[2], and only synthesize specific concrete concepts when they're needed. (E. g., "a triangular lightbulb", which we could easily generate – which our world-models effectively "contain" – but which isn't generated until needed.)

I expect inventions are the same thing. Given a powerful-enough world-model, we should be able to produce what we want just by using the world-model's native functions for that. Pick the needed concepts, plug them into each other in the right way, hit "run".

If constructing the concepts we want requires agency, the one contributing it could be the human operator, if they understand how the world-model works well enough.

Will e-mail regarding the rest.

It’s funny that I’m always begging people to stop trying to reverse-engineer the neocortex, and you’re working on something that (if successful) would end up somewhere pretty similar to that, IIUC

The irony is not lost on me. When I was reading your Foom & Doom posts, and got to this section, I did have a reaction roughly along those lines.

(But hmm, I guess if a paranoid doom-pilled person was trying to reverse-engineer the neocortex and only publish the results if they thought it would help with safe & beneficial AGI, and if they in fact had good judgment on that question, then I guess I’d be grudgingly OK with that.)

I genuinely appreciate the sanity-check and the vote of confidence here!

  1. ^

    Indeed, we might want to actively avoid that.

  2. ^

    Perhaps something along the lines of the constructive-PID thing I sketched out.

Reply
Thane Ruthenis's Shortform
Thane Ruthenis2mo43

Sooo, apparently OpenAI's mysterious breakthrough technique for generalizing RL to hard-to-verify domains that scored them IMO gold is just... "use the LLM as a judge"? Sources: the main one is paywalled, but this seems to capture the main data, and you can also search for various crumbs here and here.

The technical details of how exactly the universal verifier works aren’t yet clear. Essentially, it involves tasking an LLM with the job of checking and grading another model’s answers by using various sources to research them.

My understanding is that they approximate an oracle verifier by an LLM with more compute and access to more information and tools, then train the model to be accurate by this approximate-oracle's lights.

Now, it's possible that the journalists are completely misinterpreting the thing they're reporting on, or that it's all some galaxy-brained OpenAI op to mislead the competition. It's also possible that there's some incredibly clever trick for making it work much better than how it sounds like it'd work.

But if that's indeed the accurate description of the underlying reality, that's... kind of underwhelming. I'm curious how far this can scale, but I'm not feeling very threatened by it.

(Haven't seen this discussed on LW, kudos to @lwreader132 for bringing it to my attention.)

Reply
Thane Ruthenis's Shortform
Thane Ruthenis3mo*168

It seems to me that many disagreements regarding whether the world can be made robust against a superintelligent attack (e. g., the recent exchange here) are downstream of different people taking on a mathematician's vs. a hacker's mindset.

Quoting Gwern:

A mathematician might try to transform a program up into successively more abstract representations to eventually show it is trivially correct; a hacker would prefer to compile a program down into its most concrete representation to brute force all execution paths & find an exploit trivially proving it incorrect.

Imagine the world as a multi-level abstract structure, with different systems (biological cells, human minds, governments, cybersecurity systems, etc.) implemented on different abstraction layers. 

  • If you look at it through a mathematician's lens, you consider each abstraction layer approximately robust. Making things secure, then, is mostly about working within each abstraction layer, building systems that are secure under the assumptions of a given abstraction layer's validity. You write provably secure code, you educate people to resist psychological manipulations, you inoculate them against viral bioweapons, you implement robust security policies and high-quality governance systems, et cetera.
    • In this view, security is a phatic problem, an once-and-done thing.
    • In warfare terms, it's a paradigm in which sufficiently advanced static fortifications rule the day, and the bar for "sufficiently advanced" is not that high.
  • If you look at it through a hacker's lens, you consider each abstraction layer inherently leaky. Making things secure, then, is mostly about discovering all the ways leaks could happen and patching them up. Worse yet, the tools you use to implement your patches are themselves leakily implemented. Proven-secure code is foiled by hardware vulnerabilities that cause programs to move to theoretically impossible states; the abstractions of human minds are circumvented by Basilisk hacks; the adversary intervenes on the logistical lines for your anti-bioweapon tools and sabotages them; robust security policies and governance systems are foiled by compromising the people implementing them rather than by clever rules-lawyering; and so on.
    • In this view, security is an anti-inductive problem, an ever-moving target.
    • In warfare terms, it's a paradigm that favors maneuver warfare, and static fortifications are just big dumb objects to walk around.

The mindsets also then differ regarding what they expect ASI to be good at.

"Mathematicians" expect really sophisticated within-layer performance: really good technology, really good logistics, really good rhetoric, et cetera. This can still make an ASI really, really powerful, powerful enough to defeat all of humanity combined. But ultimately, in any given engagement, ASI plays "by the rules", in a certain abstract sense. Each of its tools can in-principle be defended-against on the terms of the abstraction layer at which they're deployed. All it would take is counter-deploying systems that are sufficiently theoretically robust, and doing so on all abstraction layers simultaneously. Very difficult, but ultimately doable, and definitely not hopeless.

"Hackers" expect really good generalized hacking. No amount of pre-superintelligent preparation is going to suffice against it, because any given tool we deploy, any given secure system we set up, would itself have implementation-level holes in it that the ASI's schemes would be able to worm through. It may at best delay the ASI for a little bit, but the attack surface is too high-dimensional, and the ASI is able to plot routes through that high-dimensional space which we can't quite wrap our head around.

As you might've surmised, I favour the hacker mindset here.

Now, arguably, any given plot to compromise an abstraction layer is itself deployed from within some other abstraction layer, so a competent mathematician's mindset shouldn't really be weaker than a hacker's. For example, secure software is made insecure by exploiting hardware vulnerabilities, and "defend against hardware vulnerabilities" is something a mathematician is perfectly able to understand and execute on. Same for securing against Basilisk hacks, logistical sabotage, etc.

But the mathematician is still, in some sense, "not getting it"; still centrally thinks in terms of within-layer attacks, rather than native cross-layer attacks.

One core thing here is that a cross-layer attack doesn't necessarily look like a meaningful attack within the context of any one layer. For example, there's apparently an exploit where you modulate the RPM of a hard drive in order to exfiltrate data from an airgapped server using a microphone. By itself, placing a microphone next to an airgapped server isn't a "hardware attack" in any meaningful sense (especially if it doesn't have dedicated audio outputs), and some fiddling with a hard drive's RPM isn't a "software attack" either. Taken separately, within each layer, both just look like random actions. You therefore can't really discover (and secure against) this type of attack if, in any given instance, you reason in terms of a single abstraction layer.

So I think a hacker's mindset is the more correct way to look at the problem.

And, looking at things from within a hacker's mindset, I think it's near straight-up impossible for a non-superintelligence to build any nontrivially complicated system that would be secure against a superintelligent attack.

Like... Humanity vs. ASI is sometimes analogized to a chess battle, with one side arguing that Stockfish is guaranteed to beat any human, even if you don't know the exact sequence of moves it will play, and the other side joking that the human can just flip the board.

But, uh. In this metaphor, the one coming up with the idea to flip the board[1], instead of playing by the rules, would be the ASI, not the human.

  1. ^

    Or, perhaps, to execute a pattern of chess-piece moves which, as the human reasons about them, push them onto trains of thought that ultimately trigger a trauma response in the human, causing them to resign.

Reply5
johnswentworth's Shortform
Thane Ruthenis3mo10

I agree that it's a promising direction.

I did actually try a bit of that back in the o1 days. What I've found is that getting LLMs to output formal Lean proofs is pretty difficult: they really don't want to do that. When they're not making mistakes, they use informal language as connective tissue between Lean snippets, they put in "sorry"s (a placeholder that makes a lemma evaluate as proven), and otherwise try to weasel out of it.

This is something that should be solvable by fine-tuning, but at the time, there weren't any publicly available decent models fine-tuned for that.

We do have DeepSeek-Prover-V2 now, though. I should look into it at some point. But I am not optimistic, sounds like it's doing the same stuff, just more cleverly.

Relevant: Terence Tao does find them helpful for some Lean-related applications.

Reply
johnswentworth's Shortform
Thane Ruthenis3mo*116

(Disclaimer: only partially relevant rant.)

Outside of [coding], I don't know of it being more than a somewhat better google

I've recently tried heavily leveraging o3 as part of a math-research loop.

I have never been more bearish on LLMs automating any kind of research than I am now.

And I've tried lots of ways to make it work. I've tried telling it to solve the problem without any further directions, I've tried telling it to analyze the problem instead of attempting to solve it, I've tried dumping my own analysis of the problem into its context window, I've tried getting it to search for relevant lemmas/proofs in math literature instead of attempting to solve it, I've tried picking out a subproblem and telling it to focus on that, I've tried giving it directions/proof sketches, I've tried various power-user system prompts, I've tried resampling the output thrice and picking the best one. None of this made it particularly helpful, and the bulk of the time was spent trying to spot where it's lying or confabulating to me in its arguments or proofs (which it ~always did).

It was kind of okay for tasks like "here's a toy setup, use a well-known formula to compute the relationships between A and B", or "try to rearrange this expression into a specific form using well-known identities", which are relatively menial and freed up my working memory for more complicated tasks. But it's pretty minor usefulness (and you have to re-check the outputs for errors anyway).

I assume there are math problems at which they do okay, but that capability sure is brittle. I don't want to overupdate here, but geez, getting LLMs from here to the Singularity in 2-3 years just doesn't feel plausible.

Reply1
Foom & Doom 1: “Brain in a box in a basement”
Thane Ruthenis3mo65

Fully agree with everything in this post, this is exactly my model as well. (That's the reason behind my last-line rug-pull here, by the way.)

Reply
Acausal normalcy
Thane Ruthenis3mo20

The way I'd phrase it[1] is that the set of all acausal deals made by every civilization with every other civilization potentially has an abstract hierarchical structure, same way everything else does. Meaning there are commonly reoccurring low-level patterns and robust emergent high-level dynamics, and you can figure those out (and start following them) without actually explicitly running full-fidelity simulations of all these other civilizations. Doing so would then in-expectation yield you a fair percentage of the information you'd get from running said full-fidelity simulations.

This is similar to e. g. how we can use the abstractions of "government", "culture", "society" and "economy" to predict the behavior of humans on Earth, without running full-fidelity simulations of each individual person, and how this lets us mostly correctly predict the rough shape of all of their behaviors.

I think it's on-its-face plausible that the acausal "society" is the same. There are some reasons to think there are convergently reoccurring dynamics (see the boundaries discussion), the space of acausal deals/Tegmark IV probably has a sort of "landscape"/high-level order to it, etc.

(Another frame: instead of running individual full-fidelity simulations of every individual civilization you're dealing with, you can run a coarse-grained/approximate simulation of the entirety of Tegmark IV, and then use just that to figure out roughly what sorts of deals you should be making.)

  1. ^

    Or maybe this is a completely different idea/misinterpretation of the post. I've read it years ago and only skimmed it now, I may be misremembering. Sorry if so.

Reply
Natural Latents: The Concepts
Thane Ruthenis4mo*21

Cool. I've had the same idea, that we want something like "synergistic information present in each random subset of the system's constituents", and yeah, it doesn't work out-of-the-box.

Some other issues there:

  • If we're actually sampling random individual atoms all around the dog's body, it seems to me that we'd need an incredibly large amount of them to decode anything useful. Much fewer than if we were sampling random small connected chunks of atoms.
    • More intuitive example: Suppose we want to infer a book's topic. What's the smallest N such that we can likely infer the topic from a random string of length N? Comparatively, what's the smallest M such that we can infer it from M letters randomly and independently sampled from the book's text? It seems to me that  N≪M.
  • But introducing "chunks of nearby variables" requires figuring out what "nearby" is, i. e., defining some topology for the low-level representation. How does that work?
  • Further, the size of the chunk needed depends a lot on which part of the system we sample, so just going "a flat % of all constituents" doesn't work. Consider happening to land on a DNA string vs. some random part of the interior of the dog's stomach.
    • Actually, dogs are kind of a bad example, animals do have DNA signatures spread all around them. A complex robot, then. If we have a diverse variety of robots, inferring the specific type is easy if we sample e. g. part of the hardware implementing its long-term memory, but not if we sample a random part of an appendage.
    • Or a random passage from the book vs. the titles of the book's chapters. Or even just "a sample of a particularly info-dense paragraph" vs. "a sample from an unrelated anecdote from the author's life". % of the total letter count just doesn't seem like the right notion of "smallness".
  • On the flip side, sometimes it's reversed: sometimes we do want to sample random unconnected atoms. E. g., the nanomachine example: if we happen to sample the "chunk" corresponding to appendage#12, we risk learning nothing about the high-level state, whereas if we sample three random atoms from different parts of it, that might determine the high-level state uniquely. So now the desired topology of the samples is different: we want non-connected chunks.

I'm currently thinking this is solved by abstraction hierarchies. Like, maybe the basic definition of an abstraction is of the "redundant synergistic variable" type, and the lowest-level abstractions are defined over the lowest-level elements (molecules over atoms). But then higher-level abstractions are redundant-synergistic over lower-level abstractions (rather than actual lowest-level elements), and up it goes. The definitions of the lower-level abstractions provide the topology + sizing + symmetries, which higher-level abstractions then hook up to. (Note that this forces us to actually step through the levels, either bottom-up or top-down.)

As examples:

  • The states of the nanomachines' modules are inferable from any subset of the modules' constituent atoms, and the state of the nanomachine itself is inferable from the states of any subset of the modules. But there's no such neat relationships between atoms and the high-level state.
  • "A carbon atom" is synergistic information about a chunk of voxels (baking-in how that chunk could vary, e. g. rotations, spatial translations); "a DNA molecule" is synergistic information about a bunch of atoms (likewise defining custom symmetries under which atom-compositions still count as a DNA molecule); "skin tissue" is synergistic over molecules; and somewhere up there we have "a dog" synergistic over custom-defined animal-parts.

Or something vaguely like that; this doesn't exactly work either. I'll have more to say about this once I finish distilling my notes for external consumption instead of expanding them, which is going to happen any... day... now...

Reply
When is it important that open-weight models aren't released? My thoughts on the benefits and dangers of open-weight models in response to developments in CBRN capabilities.
Thane Ruthenis4mo10

I very tentatively agree with that.

I'd guess it's somewhat unlikely that large AI companies or governments would want to continue releasing models with open weights once they are this capable, though underestimating capabilities is possible

I think that's a real concern, though. I think the central route by which going open-source at the current capability level leads to extinction is a powerful AI model successfully sandbagging during internal evals (which seems pretty easy for an actually dangerous model to do, given evals' current state), getting open-sourced, and things then going the "rogue replication" route.

Reply
Natural Latents: The Concepts
Thane Ruthenis4mo*51

I'm possibly missing something basic here, but: how is the redund/latent-focused natural-abstraction theory supposed to deal with synergistic information (and "emergent" dynamics)?

Consider a dog at the level of atoms. It's not, actually, the case that "this is a dog" is redundantly encoded in each atom. Even if each atom were clearly labeled, and we had an explicit approximately deterministic  atom_configuration→animal function, the state of any individual atom would constrain the output not at all. Atom#2354 being in a state #7532 is consistent with its comprising either a dog, or a cat, or an elephant...

This only stops applying if we consider macroscopically sized chunks of atoms, or the specific set of microscopically sized chunks corresponding to DNA.

And even that doesn't always work. Consider a precision-engineered nanomachine, with each atom accounted for. Intuitively, "the nanomachine's state" should be an abstraction over those atoms. However, there's not necessarily any comparatively miniscule "chunk" of the nanomachine that actually redundantly encodes its state! E. g., a given exact position of appendage#12 may be consistent either with resource-extraction or with rapid travel.

So: Suppose we have some set of random variables X representing some cube of voxels where each voxel reports what atoms are in it. Imagine a dataset of various animals (or nanomachines) in this format, of various breeds and in various positions.

"This is a dog" tells us some information about X: H(X|dog)<H(X). Indeed, it tells us a fairly rich amount of information: the general "shape" of what we should expect to see there. However, for any individual Xi, H(Xi|dog)≈H(Xi).[1] Which is to say: "this is a dog" is synergistic information about X! Not redundant information. And symmetrically, sampling a given small chunk of X won't necessarily tell us whether it's the snapshot of a dog or a cat (unless we happen to sample a DNA fragment). H(animal|X)=0, but H(animal|Xi)≈H(animal).

One way around this is to suggest that cats/dogs/nanomachines aren't abstractions over their constituent parts, but abstractions over the resampling of all their constituent parts under state transitions. I. e., suppose we now have 3D video recordings: then "this is a dog" is redundantly encoded in each X(t) for t∈[tstart,tend].

But that seems counterintuitive/underambitious. Intuitively, tons of abstractions are about robust synergistic information/emergent dynamics.

Is there some obvious way around all that, or it's currently an open question?

  1. ^

    Though it's not literally zero. E. g., if we have a fixed-size voxel cube, then depending on whether it's a dog or an elephant, we should expect the voxels at the edges to be more or less likely to contain air vs. flesh.

Reply
Load More
AI Safety Public Materials
3 years ago
(+195)
10Synthesizing Standalone World-Models, Part 4: Metaphysical Justifications
1d
0
6Synthesizing Standalone World-Models, Part 3: Dataset-Assembly
2d
0
8Synthesizing Standalone World-Models, Part 2: Shifting Structures
3d
0
11Synthesizing Standalone World-Models, Part 1: Abstraction Hierarchies
4d
0
34Synthesizing Standalone World-Models (+ Bounties, Seeking Funding)
5d
2
19Abstract Mathematical Concepts vs. Abstractions Over Real-World Systems
7mo
2
17Are You More Real If You're Really Forgetful?
Q
10mo
Q
4
5Thane Ruthenis's Shortform
1y
34
30Idealized Agents Are Approximate Causal Mirrors (+ Radical Optimism on Agent Foundations)
2y
7
11How Would an Utopia-Maximizer Look Like?
2y
9
Load More