There is a disheartening irony to calling this series "Practical AI Safety" and having the longest post being about capabilities advancements which largely ignore safety.
The first part of this post consists in observing that ML applications proceed from metrics, and subsequently arguing that theoretical approaches have been unsuccessful in learning problems. This is true but irrelevant for safety, unless your proposal is to apply ML to safety problems, which reduces AI Safety to 'just find good metrics for safe behaviour'. This seems as far from a pragmatic understanding of what is needed in AI Safety as one can get.
In the process of dismissing theoretical approaches, you ask "Why do residual connections work? Why does fractal data augmentation help?" These are exactly the kind of questions which we need to be building theory for, not to improve performance, but for humans to understand what is happening well enough to identify potential risks orthogonal to the benchmarks which such techniques are improving against, or trust that such risks are not present.
You say, "If we want to have any hope of influencing the ML community broadly, we need to understand how it works (and sometimes doesn’t work) at a high level," and provide similar prefaces as motivation in other sections. I find these claims credible, assuming the "we" refers to AI Safety researchers, but considering the alleged pragmatism of this sequence, it's surprising to me that none of the claims are followed up with suggested action points. Given the information you have provided, how can we influence this community? By publishing ML papers at NeurIPS? And to what end are you hoping to influence them? AI Safety can attract attention, but attention alone doesn't translate into progress (or even into more person-hours).
Your disdain for theoretical approaches is transparent here (if it wasn't already from the name of this sequence). But your reasoning cuts both ways. You say, "Even if the current paradigm is flawed and a new paradigm is needed, this does not mean that [a researcher's] favorite paradigm will become that new paradigm. They cannot ignore or bargain with the paradigm that will actually work; they must align with it." I expect that 'metrics suffice', (a strawperson of) your favoured paradigm, will not be the paradigm that will actually work, and it's disappointing that your sequence carries the message (to my reading) that technical ML researchers can make significant progress in alignment and safety without really changing what they're doing.
If I haven't found a way to extend my post-doc position (ending in August) by mid-July and by some miracle this job offer is still open, it could be the perfect job for me. Otherwise, I look forward to seeing the results.
The example you give has a pretty simple lattice of preferences, which lends itself to illustrations but which might create some misconceptions about how the subagent model should be formalized. For example, in your example you assume that the agents' preferences are orthogonal (one cares about pepperoni, the other about mushrooms, and each is indifferent to the opposite direction), the agents have equal weighting in the decision-making, the lattice is distributive... Compensating for these factors, there are many ways that a given 'weak utility' can be expressed in terms of subagents. I'm sure there are optimization questions that follow here, about the minimum number of subagents (dimensions) needed to embed a given weak-utility function (partially ordered set), and about when reasonable constraints such as orthogonality of subagents can be imposed. There are also composition questions: how does a committee of agents with subagents behave?
It's really nice to see a critical take on analytic philosophy, thank you for this post. The call-out aspect was also appreciated: coming from mathematics, where people are often quite reckless about naming conventions to the detriment of pedagogical dimensions of the field, it is quite refreshing.
On the philosophy content, it seems to me that many of the vices of analytic philosophy seem hard to shake, even for a critic such as yourself.
Consider the "Back to the text" section. There is some irony in your accusation of Chalmers basing his strategy on its name via its definition rather than the converse, yet you end that section with giving a definition-by-example of what engineering is and proceed with that definition. To me, this points to the tension between dismissing the idea that concepts should be well-defined notions in philosophical discourse, while relying on at least some precision of denotation in using names of concepts in discourse.
You also seem to lean on anthropological principles as analytic philosophy does. I agree that the only concepts which will appear in philosophical discourse will be those which are relevant to human experience, but that experience extends far beyond "human life" to anything of human interest (consider the language of physics and mathematics, which often doesn't have direct relation to our immediate experience), and this is a consequence of the fact that philosophy is a human endeavour rather than anything intrinsic to its content.
I'd like to take a different perspective on your Schmidhuber quote. Contrary to your interpretation, the fact that concepts are physically encoded in neural structures supports the Platonic idea that these concepts have an independent existence (albeit a far more mundane one than Plato might have liked). The empirical philosophy approach might be construed as investigating the nature of concepts statistically. However, there is a category error happening here: in pursuing this investigation, an empirical philosopher is conflating the value of the global concept with their own "partial" perspective concept.
I would argue that, whether one is convinced they exist or not, no one is invested in communal concepts, which are the kind of fragmented, polysemous entity which you describe, for their own sake. Individuals are mostly invested in their own conceptions of concepts, and take an interest in communal concepts only insofar as they are interested in being included in the community in which it resides. In short, relativism is an alternative way to resolve concepts: we can proceed not by rejecting the idea that concepts can have clear definitions (which serve to ground discourse in place of the more nebulous intuitions which motivate them), but rather by recognizing that any such definitions must come with a limited scope. I also personally reject the idea that a definition should be expected to conform to all of the various "intuitions" which are appealed to in classical philosophy for various reasons, but especially because there seems no a priori reason that any human should have infallible (or even rational) intuitions about concepts.
I might even go so far as to say that recognizing relativism incorporates your divide and conquer approach to resolving disagreement: the gardeners and landscape artists can avoid confusion when they discuss the concept of soil by recognizing their differing associations with the concept and hence specifying the details relevant to the union of their interests. But each can discard the extraneous details in discussion with their own community, just as physicists will go back to talking about "sound" in its narrowed sense when talking with other physicists. These narrowings only seem problematic if one expects the scope of all discourse to be universal.
Re "I'm not fully sold on category theory as a mathematical tool", if someone (e.g. me) were to take the category you've outlined and run with it, in the sense of establishing its general structure and special features, could you be convinced? Are there questions that you have about this category that you currently are only able to answer by brute force computation from the definitions of the objects and morphisms as you've given them? More generally, are there variants of this category that you've considered that it might be useful to study in parallel?
I am very experienced in category theory but not the Chu construction (or *-autonomous categories in general). There is a widely used notion of subobject of an object A in a category C as "equivalence class of monomorphisms with codomain A". This differs from your definition most conspicuously in the case of ⊤ where there is no morphism from this frame to a typical frame.
If I'm calculating correctly, the standard notion of subobject is strictly stronger than the one you present here (as long as the world W is inhabited, and even in that case I think the construction collapses enough to make it true) since monomorphisms are morphisms which are injective in their agent argument and surjective in their environment argument, and we can extend any morphism to ⊥ along such a monomorphism.
Now, I notice that you refer to the concepts in this post as subagents rather than subframes, so perhaps you were deliberately avoiding this stronger concept. Intuitively, a subframe in the sense I describe above consists of an agent with a subset of the available options and who may not be able to distinguish between some of the environments present in a larger frame; the "precommitted agent" you mention early on here seems to be a special case of this which is the identity in the environment component. Incidentally, the equivalence relation corresponding to this notion of subobject corresponds to isomorphism in the finite case but is non-trivial for a similar reason to the case you described of infinite frames.
I wonder if you have any thoughts about how these notions compare? It's clear from the discussion that you chose a definition which reflected what you wanted to express, which is always good, but on the other hand the monomorphisms I described will crop up when you consider factorizations of the morphisms in your category more generally. Perhaps they could be useful to you.