On the role of values: values clearly do play some role in determining which abstractions we use. An alien who observes Earth but does not care about anything on Earth's surface will likely not have a concept of trees, any more than an alien which has not observed Earth at all. Indifference has a similar effect to lack of data.
However, I expect that the space of abstractions is (approximately) discrete. A mind may use the tree-concept, or not use the tree-concept, but there is no natural abstraction arbitrarily-close-to-tree-but-not-the-same-as-tree. There is no continuum of tree-like abstractions.
So, under this model, values play a role in determining which abstractions we end up choosing, from the discrete set of available abstractions. But they do not play any role in determining the set of abstractions available. For AI/alignment purposes, this is all we need: as long as the set of natural abstractions is discrete and value-independent, and humans concepts are drawn from that set, we can precisely define human concepts without a detailed model of human values.
Also, a mostly-unrelated note on the airplane example: when we're trying to "define" a concept by drawing a bounding box in some space (in this case, a literal bounding box in physical space), it is almost always the case that the bounding box will not actually correspond to the natural abstraction. This is basically the same idea as the cluster structure of thingspace and rubes vs bleggs. (Indeed, Bayesian clustering is directly interpretable as abstraction discovery: the cluster-statistics are the abstract summaries, and they induce conditional independence between the points in each cluster.) So I would interpret the airplanes exampe (and most similar examples in the legal system) not as a change in a natural concept, but rather as humans being bad at formally defining their natural concepts, and needing to update their definitions as new situations crop up. The definitions are not the natural concepts; they're proxies.
Also interested in helping on this - if there's modelling you'd want to outsource.
Here's one fairly-standalone project which I probably won't get to soon. It would be a fair bit of work, but also potentially very impressive in terms of both showing off technical skills and producing cool results.
Short somewhat-oversimplified version: take a finite-element model of some realistic objects. Backpropagate to compute the jacobian of final state variables with respect to initial state variables. Take a singular value decomposition of the jacobian. Hypothesis: the singular vectors will roughly map to human-recognizable high-level objects in the simulation (i.e. the nonzero elements of any given singular vector should be the positions and momenta of each of the finite elements comprising one object).Longer version: conceptually, we imagine that there's some small independent Gaussian noise in each of the variables defining the initial conditions of the simulation (i.e. positions and momenta of each finite element). Assuming the dynamics are such that the uncertainty remains small throughout the simulation - i.e. the system is not chaotic - our uncertainty in the final positions is then also Gaussian, found by multiplying the initial distribution by the jacobian matrix. The hypothesis that information-at-a-distance (in this case "distance" = later time) is low-dimensional then basically says that the final distribution (and therefore the jacobian) is approximately low-rank.
In order for this to both work and be interesting, there are some constraints on both the system and on how the simulation is set up. First, "not chaotic" is a pretty big limitation. Second, we want the things-simulated to not just be pure rigid-body objects, since in that case it's pretty obvious that the method will work and it's not particularly interesting. Two potentially-interesting cases to try:
If you wanted to produce a really cool visual result, then I'd recommend setting up the simulation in Houdini, then attempting to make it play well with backpropagation. That would be a whole project in itself, but if viable the results would be very flashy.
Important implementation note: you'd probably want to avoid explicitly calculating the jacobian. Code it as a linear operator - i.e. a function which takes in a vector, and returns the product of the jacobian times that vector - and then use a sparse SVD method to find the largest singular values and corresponding singular vectors. (Unless you know how to work efficiently with jacobian matrices without doing that, but that's a pretty unusual thing to know.)
Re: dual use, I do have some thoughts on exactly what sort of capabilities would potentially come out of this.
The really interesting possibility is that we end up able to precisely specify high-level human concepts - a real-life language of the birds. The specifications would correctly capture what-we-actually-mean, so they wouldn't be prone to goodhart. That would mean, for instance, being able to formally specify "strawberry on a plate" in non-goodhartable way, so an AI optimizing for a strawberry on a plate would actually produce a strawberry on a plate. Of course, that does not mean that an AI optimizing for that specification would be safe - it would actually produce a strawberry on a plate, but it would still be perfectly happy to take over the world and knock over various vases in the process.
Of course just generally improving the performance of black-box ML is another possibility, but I don't think this sort of research is likely to induce a step-change in that department; it would just be another incremental improvement. However, if alignment is a bottleneck to extracting economic value from black-box ML systems, then this is the sort of research which would potentially relax that bottleneck without actually solving the full alignment problem. In other words, it would potentially make it easier to produce economically-useful ML systems in the short term, using techniques which lead to AGI disasters in the long term.
Re: picking up new tools, skills and practice designing and building user interfaces, especially to complex or not-very-transparent systems, would be very-high-leverage if the tool-adoption step is rate-limiting.
Relevant topic of a future post: some of the ideas from Risks From Learned Optimization or the Improved Good Regulator Theorem offer insights into building effective institutions and developing flexible problem-solving capacity.
Rough intuitive idea: intelligence/agency are about generalizable problem-solving capability. How do you incentivize generalizable problem-solving capability? Ask the system to solve a wide variety of problems, or a problem general enough to encompass a wide variety.
If you want an organization to act agenty, then a useful technique is to constantly force the organization to solve new, qualitatively different problems. An organization in a highly volatile market subject to lots of shocks or distribution shifts will likely develop some degree of agency naturally.
Organizations with an adversary (e.g. traders in the financial markets) will likely develop some degree of agency naturally, as their adversary frequently adopts new methods to counter the organization's current strategy. Red teams are a good way to simulate this without a natural adversary.
Some organizations need to solve a sufficiently-broad range of problems as part of their original core business that they develop some degree of agency in the process. These organizations then find it relatively easy to expand into new lines of business. Amazon is a good example.
Conversely, businesses in stable industries facing little variability will end up with little agency. They can't solve new problems efficiently, and will likely be wiped out if there's a large shock or distribution shift in the market. They won't be good at expanding or pivoting into new lines of business. They'll tend to be adaptation-executors rather than profit-maximizers, to a much greater extent than agenty businesses.
This all also applies at a personal level: if you want to develop general problem-solving capability, then tackle a wide variety of problems. Try problems in many different fields. Try problems with an adversary. Try different kinds of problems, or problems with different levels of difficulty. Don't just try to guess which skills or tools generalize well, go out and find out which skills or tools generalize well.
If we don't know what to expect from future alignment problems, then developing problem-solving skills and organizations which generalize well is a natural strategy.
This post seems to me to be beating around the bush. There's several different classes of transparency methods evaluated by several different proxy criteria, but this is all sort of tangential to the thing which actually matters: we do not understand what "understanding a model" means, at least not in a sufficiently-robust-and-legible way to confidently put optimization pressure on it.
For transparency via inspection, the problem is that we don't know what kind of "understanding" is required to rule out bad behavior. We can notice that some low-level features implement a certain kind of filter, and that is definitely a valuable clue to how these things work, but we probably wouldn't be able to tell if those filters were wired together in a way which implemented a search algorithm.
For transparency via training, the problem is that we don't know what metric measures the "understandability" of the system in a way sufficient to rule out bad behavior. The tree regularization example is a good one - it is not the right metric, and we don't know what to replace it with. Use a proxy, get goodharted.
For transparency via architecture, the problem is that we don't know what architectural features allow "understanding" of the kind needed to rule out bad behavior. Does clusterability actually matter? No idea.
Under the hood, it's all the same problem.
One way in which the analogy breaks down: in the lever case, we have two levers right next to each other, and each does something we want - it's just easy to confuse the levers. A better analogy for AI might be: many levers and switches and dials have to be set to get the behavior we want, and mistakes in some of them matter while others don't, and we don't know which ones matter when. And sometimes people will figure out that a particular combination extends the flaps, so they'll say "do this to extend the flaps", except that when some other switch has the wrong setting and it's between 4 and 5 pm on Thursday that combination will still extend the flaps, but it will also retract the landing gear, and nobody noticed that before they wrote down the instructions for how to extend the flaps.
Some features which this analogy better highlights:
Yup, this is basically where that probability came from. It still feels about right.
This is a great explanation. I basically agree, and this is exactly why I expect alignment-by-default to most likely fail even conditional on the natural abstractions hypothesis holding up.
This is the best explanation I have seen yet of what seem to me to be the main problems with HCH. In particular, that scene from HPMOR is one that I've also thought about as a good analogue for HCH problems. (Though I do think the "humans are bad at things" issue is more probably important for HCH than the malicious memes problem; HCH is basically a giant bureaucracy, and the same shortcomings which make humans bad at giant bureaucracies will directly limit HCH.)