johnswentworth

Sequences

"Why Not Just..."
Basic Foundations for Agent Models

Wiki Contributions

Comments

Rant on Problem Factorization for Alignment

It may well be the case that even if you removed [incentive to distort the credit assignment], credit assignment would still be a major problem for things like HCH, but how can you know this from empirical experience with real-world human institutions (which you emphasize in the OP)?

Because there exist human institutions in which people generally seem basically aligned and not trying to game the credit assignment. For instance, most of the startups I've worked at were like this (size ~20 people), and I think the alignment research community is basically like this today (although I'll be surprised if that lasts another 3 years). Probably lots of small-to-medium size orgs are like this, especially in the nonprofit space. It's hard to get very big orgs/communities without letting in some credit monsters, but medium-size is still large enough to see coordination problems kick in (we had no shortage of them at ~20-person startups).

And, to be clear, I'm not saying these orgs have zero incentive to distort credit assignment. Humans do tend to do that sort of thing reflexively, to some extent. But to the extent that it's reflexive, it would also apply to HCH and variants thereof. For instance, people in HCH would still reflexively tend to conceal evidence/arguments contradicting their answers.  (And when someone does conceal contradictory evidence/arguments, that would presumably increase the memetic fitness of their claims, causing them to propagate further up the tree, so that also provides a selection channel.) Similarly, if the HCH implementation has access to empirical testing channels and the ability to exchange multiple messages, people would still reflexively tend to avoid/bury tests which they expect will actually falsify their answers, or try to blame incorrect answers on subquestions elsewhere in the tree when an unexpected experimental outcome occurs and someone tries to backpropagate to figure out where the prediction-failure came from. (And, again, those who shift blame successfully will presumably have more memetic fitness, etc.)

Rant on Problem Factorization for Alignment

Yeah, at some point we're basically simulating the alignment community (or possibly several copies thereof interacting with each other). There will probably be another post on that topic soonish.

Rant on Problem Factorization for Alignment

It seems to me that lack of frequent empirical grounding is what makes HCH particularly vulnerable to memetic selection.

Would you still expect this to go badly wrong (assume you get to pick the humans)?  If [yes, no], what do you see as the important differences?

Ok, so, some background on my mental image. Before yesterday, I had never pictured HCH as a tree of John Wentworths (thank you Rohin for that). When I do picture John Wentworths, they mostly just... refuse to do the HCH thing. Like, they take one look at this setup and decide to (politely) mutiny or something. Maybe they're willing to test it out, but they don't expect it to work, and it's likely that their output is something like the string "lol nope". I think an entire society of John Wentworths would probably just not have bureaucracies at all; nobody would intentionally create them, and if they formed accidentally nobody would work for them or deal with them.

Now, there's a whole space of things-like-HCH, and some of them look less like a simulated infinite bureaucracy and more like a simulated society. (The OP mostly wasn't talking about things on the simulated-society end of the spectrum, because there will be another post on that.) And I think a bunch of John Wentworths in something like a simulated society would be fine - they'd form lots of small teams working in-person, have forums like LW for reasonably-high-bandwidth interteam communication, and have bounties on problems and secondary markets on people trying to get the bounties and independent contractors and all that jazz.

Anyway, back to your question. If those John Wentworths lacked the ability to run experiments, they would be relatively pessimistic about their own chances, and a huge portion of their work would be devoted to figuring out how to pump bits of information and stay grounded without a real-world experimental feedback channel. That's not a deal-breaker; background knowledge of our world already provides far more bits of evidence than any experiment ever run, and we could still run experiments on the simulated-Johns. But I sure would be a lot more optimistic with an experimental channel.

I do not think memetic selection in particular would cripple those Johns, because that's exactly the sort of thing they'd be on the lookout for. But I'm not confident of that. And I'd be a lot more pessimistic about the vast majority of other people. (I do expect that most people think a bureaucracy/society of themselves would work better than the bureaucracies/societies we have, and I expect that at least a majority and probably a large majority are wrong about that, because bureaucracies are generally made of median-ish people. So I am very suspicious of my inner simulator saying "well, if it was a bunch of copies of John Wentworth, they would know to avoid the failure modes which mess up real-world bureaucracies/societies". Most people probably think that, and most people are probably wrong about it.)

I do think our current civilization is crippled by memetic selection to pretty significant extent. (I mean, that's not the only way to frame it or the only piece, but it's a correct frame for a large piece.)

I don't think it's a gap in economic theory in general: pretty sure I've heard the [price mechanisms as distributed computation] idea from various Austrian-school economists without reliance on agents with different goals - only on "What should x cost in context y?" being a question whose answer depends on the entire system.

Economists do talk about that sort of thing, but I don't usually see it in their math. Of course we can get e.g. implied prices for any pareto-optimal system, but I don't know of math saying that systems will end up using those implied prices internally.

Rant on Problem Factorization for Alignment

Re disanalogy 1: I'm not entirely sure I understand what your objection is here but I'll try responding anyway.

I was mostly thinking of the unconscious economics stuff.

Personally my inner sim feels pretty great about the combination of disanalogy 1 and disanalogy 2 -- it feels like a coalition of Rohins would do so much better than an individual Rohin, as long as the Rohins had time to get familiar with a protocol and evolve it to suit their needs. (Picturing some giant number of Rohins a la disanalogy 3 is a lot harder to do but when I try it mostly feels like it probably goes fine.)

I should have asked for a mental picture sooner, this is very useful to know. Thanks.

If I imagine a bunch of Johns, I think that they basically do fine, though mainly because they just don't end up using very many Johns. I do think a small team of Johns would do way better than I do.

Rant on Problem Factorization for Alignment

My intuition around (1) being important mostly comes from studying things like industrial organization and theory of the firm.

Oh that's really interesting. I did a dive into theory of the firm research a couple years ago (mainly interested in applying it to alignment and subagent models) and came out with totally different takeaways. My takeaway was that the difficulty of credit assignment is a major limiting factor (and in particular this led to thinking about Incentive Design with Imperfect Credit Assignment, which in turn led to my current formulation of the Pointers Problem).

Now, the way economists usually model credit assignment is in terms of incentives, which theoretically aren't necessary if all the agents share a goal. On the other hand, looking at how groups work in practice, I expect that the informational role of credit assignment is actually the load-bearing part at least as much as (if not more than) the incentive-alignment role.

For instance, a price mechanism doesn't just align incentives, it provides information for efficient production decisions, such that it still makes sense to use a price mechanism even if everyone shares a single goal. If the agents share a common goal, then in theory there doesn't need to be a price mechanism, but a price mechanism sure is an efficient way to internally allocate resources in practice.

... and now that I'm thinking about it, there's a notable gap in economic theory here: the economists are using agents with different goals to motivate price mechanisms (and credit allocation more generally), even though the phenomenon does not seem like it should require different goals.

I'm still not getting a good picture of what your thinking is on this. Seems like the inferential gap is wider than you're expecting? Can you go into more details, and maybe include an example?

Memetics example: in the vanilla HCH tree, some agent way down the tree ignores their original task and returns an answer which says "the top-level question asker urgently needs to know X!" followed by some argument. And that sort of argument, if it has high memetic fitness (independent of whether it's correct), gets passed all the way back up the tree. The higher the memetic fitness, the further it propagates.

And if we have an exponentially large tree, with this sort of thing being generated a nontrivial fraction of the time, then there will be lots of these things generated. And there will be a selection process as more-memetically-fit messages get passed up, collide with each other, and people have to choose which ones to pass further up. What pops out at the top is, potentially, very-highly-optimized memes drawn from an exponentially large search space.

And of course this all applies even if the individual agents are all well-intentioned and trying their best. As with "unconscious economics", it's the selection pressures which dominate, not the individuals' intentions.

Rant on Problem Factorization for Alignment

Typical humans in typical bureaucracies do not seem at all aligned with the goals that the bureaucracy is meant to pursue.

Why would this be any different for simulated humans or for human-mimicry based AI (which is what ~all of the problem-factorization-based alignment strategies I've seen are based on)?

Since you reuse one AI model for each element of the bureaucracy, doing prework to establish sophisticated coordinated protocols for the bureaucracy takes a constant amount of effort, whereas in human bureaucracies it would scale linearly with the number of people. As a result with the same budget you can establish a much more sophisticated protocol with AI than with humans.

This one I buy. Though if it's going to be the key load-bearing piece which makes e.g. something HCH-like work better than the corresponding existing institutions, then it really ought to play a more central role in proposals, and testing it on humans now should be a high priority. (Some of Ought's work roughly fits that, so kudos to them, but I don't know of anyone else doing that sort of thing.)

After a mere 100 iterations of iterated distillation and amplification where each agent can ask 2 subquestions, you are approximating a bureaucracy of 2^100 agents, which is wildly larger than any human bureaucracy and has qualitatively different strategies available to it. Probably it will be a relatively bad approximation but the exponential scaling with linear iterations still seems pretty majorly different from human bureaucracies.

Empirically it does not seem like bureaucracies' problems get better as they get bigger. It seems like they get worse. And like, sure, maybe there's a phase change if you go to really exponentially bigger sizes, but "maybe there's a phase change and it scales totally differently than we're used to and this happens to be a good thing rather than a bad thing" is the sort of argument you could make about anything, we really need some other reason to think that hypothesis is worth distinguishing at all.

I think these disanalogies are driving most of the disagreement, rather than things like "not knowing about real-world evidence" or even "failing to anticipate results in simple cases we can test today". For example, for the relay experiment you mention, at least I personally (and probably others) did in fact anticipate these results in advance.

Kudos for correct prediction!

Continuing in the spirit of expressing my highly uncharitable intuitions, my intuitive reaction to this is "hmm Rohin's inner simulator seems to be working fine, maybe he's just not actually applying it to picture what would happen in an actual bureaucracy when making changes corresponding to the proposed disanalogies". On reflection I think there's a strong chance you have tried picturing that, but I'm not confident, so I mention it just in case you haven't. (In particular disanalogy 3 seems like one which is unlikely to work in our favor when actually picturing it, and my inner sim is also moderately skeptical about disanalogy 2.)

Rant on Problem Factorization for Alignment

The main reason it would transfer to HCH (and ~all other problem-factorization-based proposals I've seen) is because the individual units in those proposals are generally human-mimickers of some kind (similar to e.g. GPT). Indeed, the original point of HCH is to be able to solve problems beyond what an individual human can solve while training on human mimickry, in order to get the outer alignment benefits of human mimickry.

E.g. for unconscious economics in particular, the selection effects mostly apply to memetics in the HCH tree. And in versions of HCH which allow repeated calls to the same human (as Paul's later version of the proposal does IIRC), unconscious economics applies in the more traditional ways as well.

The two differences you mention seem not-particularly-central to real-world institutional problems. In order to expect that existing problems wouldn't transfer, based on those two differences, we'd need some argument that those two differences address the primary bottlenecks to better performance in existing institutions. (1) seems mostly-irrelevant-in-practice to me; do you want to give an example or two of where it would be relevant? (2) has obvious relevance, but in practice I think most institutions do not have so many coordinators that it's eating up a plurality of the budget, which is what I'd expect to see if there weren't rapidly decreasing marginal returns on additional coordinators. (Though I could give a counterargument to that: there's a story in which managers, who both handle most coordination in practice and make hiring decisions, tend to make themselves a bottleneck by under-hiring coordinators, since coordinators would compete with the managers for influence.) Also it is true that particularly good coordinators are extremely expensive, so I do still put some weight on (2).

Rant on Problem Factorization for Alignment

I endorse this criticism, though I think the upsides outweigh the downsides in this case. (Specifically, the relevant upsides are (1) being able to directly discuss generators of beliefs, and (2) just directly writing up my intuitions is far less time-intensive than a view-on-reflection, to the point where I actually do it rather than never getting around to it.)

Load More