Director of Research at PAISRI
Assumption 3. A listener is said to have minimally consistent beliefs if each proposition X has a negation X*, and P(X)+P(X*)≤1.
One thing that's interesting to me is that this is assumption is frequently not satisfied in real life due to underspecification, e.g. P(I'm happy) + P(I'm not happy) ≥ 1 because "happy" may be underspecified. I can't think of a really strong minimal example, but I feel like this pops up a lot of discussions on complex issues where a dialectic develops because neither thesis nor antithesis capture everything and so both are underspecified in ways that make their naive union exceed the available probability mass.
Last summer when I was at the EA Hotel for TAISU I got the most value out of doing something similar. I'd host a session to "workshop" an idea I had, and it was roughly 20 minutes of setting it up and 40 minutes of back and forth with people pointing things out, stating objections, asking for clarifications, etc.. It was less structured than your approach, and I quite like this idea because it creates a level of safety my approach did not because it effectively bans the kinds of criticism (at least for the course of the conjecture workshop) that people sometimes jump to that also shut down fruitful idea development.
Thanks for sharing!
I agree. I'm generally okay with the order (oracles do seem marginally safer than agents, for example, and more restrictions should generally be safer than less), but also think the marginal amount of additional safety doesn't matter much when you consider the total absolute risk. Just to make up some numbers, I think of it like choosing between options that are 99.6%, 99.7%, 99.8%, and 99.9% likely to result in disaster. I mean of course I'll pick the one with a 0.4% chance of success, but I'd much rather do something radically different that is orders of magnitude safer.
In my mind it's something like you need:
I think people tend to emphasize the technical skills the most, and I'm sure other answers will offer more specific suggestions there, but I also think there's an import aspect of having the right mindset for this kind of work such that a person with the right technical skills might not make much progress on AI safety without these other "soft" skills.
Am I the right "kind" of researcher for working in AI Safety? Here, my main intuition is that the field needs more "theory-builders" than "problem-solvers", to take the archetypes of Gower's Two Cultures of Mathematics. By that I mean that AI Safety has not yet cristallize into a field where the main approaches and questions are well understood and known. Almost every researcher has a different perspective on what is fundamental in the field. Therefore, the most useful works will be the ones that clarify, deconfuse and characterize the fundamental questions and problems in the field.
To add on to this, it also means it's going to be somewhat hard to know if you're right kind of researcher or not because the feedback cycle is long and you may be doing good work but it's work that will take months or years to come together in a way that can be easily evaluated by others.
This doesn't mean it all looks maximally like this. This is less of an issue with, say, safety research focused on machine learning than safety research focused on theoretical AI systems we don't know how to build yet or safety research focused on turning ideas about what safety looks like into something mathematical precise enough to build.
Thus a corollary of this answer might be something like "you might be the right kind of researcher only if you're okay with long (multi-year) feedback cycles".
Then that suggests to me an interesting hypothesis: maybe it can’t! What if some of our weirder instincts related to memory or counterfactual imagination are not adaptive at all, but rather crosstalk from social instincts, or vice-versa? For example, I think there’s a reaction in the subcortex that listens for a strong prediction of lower reward, alternating with a weak prediction of higher reward; when it sees this combination, it issues negative reward and negative valence. Think about what this subcortical reaction would do in the three different cases: If the weak prediction it sees is an empathetic simulation, well, that’s the core of jealousy! If the weak prediction it sees is a memory, well, that’s the core of loss aversion! If the weak prediction it sees is a counterfactual imagination, well, that’s the core of, I guess, that annoying feeling of having missed out on something good. Seems to fit together pretty well, right? I’m not super confident, but at least it’s food for thought.
I think this is interesting in terms of thinking about counterfactuals in decision theory, preference theory, etc.. To me it suggests that when we talk about counterfactuals we're putting our counterfactual worlds in a stance that mixes up what they are and what we want them to be. What they are, as in the thing going on in our brains that causes us to think in terms of counterfactual worlds, is these predictions about the world (so world models or ontology), and when we apply counterfactual reasoning we're considering different predictions about the world contingent on different inputs, possibly including inputs other than the ones we actually saw but that we are able to simulate. This means that it's not reasonable that counterfactual worlds would be consistent with the history of the world (the standard problem with counterfactuals) because they aren't alternative territories but maps of how we think we would have mapped different territory.
This doesn't exactly save counterfactual reasoning, but it does allow us to make better sense of what it is when we use it and why it works sometimes and why it's a problem other times.
Hang on, you say: That doesn’t seem right! If it were the exact same generative models, then when we remember dancing, we would actually issue the motor commands to start dancing! Well, I answer, we do actually sometimes move a little bit when we remember a motion! I think the rule is, loosely speaking, the top-down information flow is much stronger (more confident) when predicting, and much weaker for imagination, memory, and empathy. Thus, the neocortical output signals are weaker too, and this applies to both motor control outputs and hormone outputs. (Incidentally, I think motor control outputs are further subject to thresholding processes, downstream of the neocortex, and therefore a sufficiently weak motor command causes no motion at all.)
As I recall, much of how the brain causes you to do one thing rather than another involves suppression of signals. As in, everything in the brain is doing it's thing all the time, and how you manage to do only one thing and not have a seizure is that it suppresses the signals from various parts of the brain such that only one is active at a time.
That's probably a bit of a loose model and doesn't exactly explain how it maps to particular structures, but might be interesting to look at how this sort of theory of output suppression meshes with the weaker/stronger model you're looking at here built out of PP.
This challenge from 2018 basically asked about building a data set for training AI on human values (loosely construed so as to allow many approaches) and many of the submissions proposed ways to do it. You might find some interesting ideas there.
Caveat, I won the challenge by saying I didn't such an approach would work.
I get worried about things like this article that showed up on the Partnership on AI blog. Reading it there's nothing I can really object to in the body of post: it's mostly about narrow AI alignment and promotes a positive message of targeting things that benefit society rather than narrowly maximize a simple metric. How it's titled "Aligning AI to Human Values means Picking the Right Metrics" and that implies to me a normative claim that reads in my head something like "to build aligned AI it is necessary and sufficient to pick the right metrics" which is something I think few would agree with. Yet if I was a casual observer just reading the title of this post I might come away with the impression that AI alignment is as easy as just optimizing for something prosocial, not that there are lots of hard problems to be solved to even get AI to do what you want, let alone to pick something beneficial to humanity to do.
To be fair this article has a standard "not necessarily the views of PAI, etc." disclaimer, but then the author is a research fellow at PAI.
This makes me a bit nervous about the effect of PAI on promoting AI safety in industry, especially if it effectively downplays it or makes it seem easier than it is in ways that either encourages or fails to curtail risky behavior in the use of AI in industry.
Okay, so now that I've had more time to think about it, I do really like the idea of thinking of "decisions" as the subjective expression of what it feels like to learn what universe you are in, and this holds true for the third-person perspective of considering the "decisions" of others: they still go through the whole process that feels from the inside like choosing or deciding, but from the outside there is no need to appeal to this to talk about "decisions". Instead, to the outside observers, "decisions" are just resolutions of uncertainty about what will happen to a part of the universe modeled as another agent.
This seems quite elegant for my purposes, as I don't run into the problems associated with formalizing UDT (at least, not yet), and it let's me modify my model for understanding human values to push "decisions" outside of it or into the after-the-fact part.