This is a special post for quick takes by mesaoptimizer. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

New to LessWrong?

24 comments, sorted by Click to highlight new comments since: Today at 7:32 PM

I notice that I find Valentine's posts somewhat insightful, and believe they point at incredibly neglected research directions, but I notice a huge the distance seems to exist between what Valentine intends to communicate and what most readers seem to get.

Off the top of my head:

  • Here's the Exit is written in a style that belies an astounding confidence that what Valentine says is applicable to the reader, no matter what. After a few commenters correctly critique the post, Valentine backs down and claims that it was meant to be an "invitation" for people who recognize themselves in the post to explore the thoughts that Valentine espouses in the post, and not a set of claims to evaluate. This feels slightly like a bait-and-switch, and worse, I feel like Valentine was acting in complete good faith while doing so, with a sort of very out-of-distribution model of communication they endorse for themselves.
  • In We're already in AI takeoff, Valentine seems to claim that humans should not try to intervene at the egregore-level, because we are too weak to do so. When someone points out that this may not necessarily be correct, Valentine clarifies that what they meant was more that humans should not use shoulds to try to get themselves to do something that one could be confident that they physically cannot accomplish, and that solving AI alignment or existential risk can be one example of such a thing for many people. Again, I notice how Valentine makes a ton of sense, and points [badly] at very valuable rationalist insights and concepts when requested to clarify, but the way they pointed at it in their OP was, in my opinion, disastrously bad.
  • The only thing I recall from Kensho is the Said baking metaphor for people not explaining why one should care about the meta-level thing by showing an object-level achievement done using the meta-level thing. And yet, I get the sentiment that Valentine seems to have been trying to communicate -- it sure seems like there are epistemic rationality techniques that seem incredibly valuable and neglected, and one could discover them in the course of doing something about as useless as paperwork, and talking about how you became more efficient at paperwork would seem like a waste of time to everyone involved.

The reason I wrote this down is because I think Valentine (and other people reading this) might find this helpful, and I didn't feel like it made sense to post this as a comment in any specific individual post.

And yet, I get the sentiment that Valentine seems to have been trying to communicate—it sure seems like there are epistemic rationality techniques that seem incredibly valuable and neglected, and one could discover them in the course of doing something about as useless as paperwork, and talking about how you became more efficient at paperwork would seem like a waste of time to everyone involved.

Is this a real example or one that you’ve made up? That is, do you actually have cases in mind where someone discovered valuable and neglected epistemic rationality techniques in the course of doing paperwork?

I apologize for not providing a good enough example -- yes, it was made up. Here's a more accurate explanation of what causes me to believe that Valentine's sentiment has merit:

  • It seems to me that a lot of epistemic confusion can be traced to almost unrelated upstream misconceptions. Examples: thinking that people must be suspended upside down below the equator, once someone understands the notion of an approximately spherical Earth; the illusion that mirrors create horizontal asymmetry but retain vertical symmetry; the notion that an AGI will automatically be moral. Similarly, it seems plausible to me that while attempting to fix one issue (similar to attempting to fix a confusion of the sort just listed), one could find themselves making almost unrelated upstream epistemic discoveries that might just be significantly more valuable). I do acknowledge that these epistemic discoveries do also seem object-level and communicable, and I do think that the sentiment that Valentine showed could make sense.
  • It also seems that a lot of rationality skill involves starting out with a bug one notices ("hey, I seem to be really bad at going to the gym"), and then making multiple attempts to fix the problem (ideally focusing on making an intervention as close to the 'root' of the issue as possible), and then discovering epistemic rationality techniques that may be applicable in many places. I agree that it seems like really bad strategy to then not try to explain why the technique is useful by giving another example where the technique is useful and results in good object-level outcomes, instead of simply talking about (given my original example) paperwork for a sentence and then spending paragraphs talking about some rationality technique in the abstract.

thinking that people must be suspended upside down below the equator, once someone understands the notion of an approximately spherical Earth

That page seems to be talking about a four-year-old child, who has not yet learned about space, how gravity works, etc. It’s not clear to me that there’s anything to conclude from this about what sorts of epistemic rationality techniques might be useful to adults.

More importantly, it’s not clear to me how any of your examples are supposed to be examples of “epistemic confusion [that] can be traced to almost unrelated upstream misconceptions”. Could you perhaps make the connection more explicitly?

Similarly, it seems plausible to me that while attempting to fix one issue (similar to attempting to fix a confusion of the sort just listed), one could find themselves making almost unrelated upstream epistemic discoveries that might just be significantly more valuable).

And… do you have any examples of this?

It also seems that a lot of rationality skill involves starting out with a bug one notices (“hey, I seem to be really bad at going to the gym”), and then making multiple attempts to fix the problem (ideally focusing on making an intervention as close to the ‘root’ of the issue as possible), and then discovering epistemic rationality techniques that may be applicable in many places.

There’s a lot of “<whatever> seems like it could be true” in your comment. Are you really basing your views on this subject on nothing more than abstract intuition?

I agree that it seems like really bad strategy to then not try to explain why the technique is useful by giving another example where the technique is useful and results in good object-level outcomes, instead of simply talking about (given my original example) paperwork for a sentence and then spending paragraphs talking about some rationality technique in the abstract.

If, hypothetically, you discovered some alleged epistemic rationality technique while doing paperwork, I would certainly want you to either explain how you applied this technique originally (with a worked example involving your paperwork), or explain how the reader might (or how you did) apply the technique to some other domain (with a worked example involving something else, not paperwork), or (even better!) both.

It would be very silly to just talk about the alleged technique, with no demonstration of its purported utility.

If, hypothetically, you discovered some alleged epistemic rationality technique while doing paperwork, I would certainly want you to either explain how you applied this technique originally (with a worked example involving your paperwork), or explain how the reader might (or how you did) apply the technique to some other domain (with a worked example involving something else, not paperwork), or (even better!) both.

This seems sensible, yes.

It would be very silly to just talk about the alleged technique, with no demonstration of its purported utility.

I agree that it seems silly to not demonstrate the utility of a technique when trying to discuss it! I try to give examples to support my reasoning when possible. What I attempted to do with that one passage that you seemed to have taken offense to was show that I could guess at one causal cognitive chain that would have led Valentine to feel the way they did and therefore act and communicate the way they did, not that I endorse the way Kensho was written -- because I did not get anything out of the original post.

There’s a lot of “<whatever> seems like it could be true” in your comment.

Here's a low investment attempt to point at the cause of what seems to you a verbal tic:

I can tell you that when I put “it seems to me” at the front of so many of my sentences, it’s not false humility, or insecurity, or a verbal tic. (It’s a deliberate reflection on the distance between what exists in reality, and the constellations I’ve sketched on my map.)

-- Logan Strohl, Naturalism

If you need me to write up a concrete elaboration to help you get a better idea about this, please tell me.

Are you really basing your views on this subject on nothing more than abstract intuition?

My intuitions on my claim related to rationality skill seem to be informed by concrete personal experience, which I haven't yet described in length, mainly because I expected that using a simple plausible made-up example would serve as well. I apologize for not adding a "(based on experience)" in that original quote, although I guess I assumed that was deducible.

That page seems to be talking about a four-year-old child, who has not yet learned about space, how gravity works, etc. It’s not clear to me that there’s anything to conclude from this about what sorts of epistemic rationality techniques might be useful to adults.

I'm specifically pointing at examples of deconfusion here, which I consider the main (and probably the only?) strand of epistemic rationality techniques. I concede that I haven't provided you useful information about how to do it -- but that isn't something I'd like to get into right now, when I am still wrapping my mind around deconfusion.

More importantly, it’s not clear to me how any of your examples are supposed to be examples of “epistemic confusion [that] can be traced to almost unrelated upstream misconceptions”. Could you perhaps make the connection more explicitly?

For the gravity example, the 'upstream misconception' is that the kid did not realize that 'up and down' is relative to the direction in which Earth's gravity acts on the body, and therefore the kid tries to fit the square peg of "Okay, I see that humans have heads that point up and legs that point down" into the round hole of "Below the equator, humans are pulled upward, and humans heads are up, so humans' heads point to the ground".

For the AI example, the 'upstream misconception' can be[1] conflating the notion of intelligence with 'human's behavior and tendencies that I recognize as intelligence' (and this in turn can be due to other misconceptions, such as not understanding how alien the selection process that underlies evolution is; not understanding how intelligence is not the same as saying impressive things in a social party but the ability to squeeze the probability distribution of future outcomes into a smaller space; et cetera), and then making a reasoning error that seems like anthromorphizing an AI, and concluding that the more intelligent a system would be, the more it would care about the 'right things' that us humans seem to care about.

The second example is a bit expensive to elaborate on, so I will not do so right now. I apologize.

Anyway, I intended to write this stuff up when I felt like I understood deconfusion enough that I could explain it to other people.

Similarly, it seems plausible to me that while attempting to fix one issue (similar to attempting to fix a confusion of the sort just listed), one could find themselves making almost unrelated upstream epistemic discoveries that might just be significantly more valuable).

And… do you have any examples of this?

I find this plausible based on my experience with deconfusion and my current state of understanding of the skill. I do not believe I understand deconfusion well enough to communicate it to people who have an inferential distance as huge as the one between you and I, so I do not intend to try.

[1]: There are a myriad of ways you can be confused, and only one way you can be deconfused.

I notice that Joe Carlsmith dropped a 127 page paper on the question of deceptive alignment. I am confused; who is the intended audience of this paper?

AFAICT nobody would actually read all 127 pages of the report, and most potential reasons for writing the report to me seem better served by faster feedback loops and significantly smaller research artifacts.

What am I missing?

Often I write big boring posts so I can refer to my results in shorter, more readable posts later on. That way if anyone cares and questions my result they can see the full argument, without impairing readability on the focal post. 

My model is that a text like this often is in substantial parts an artifact of the author's personal understanding. But also, my model of Open Phil employees totally read 100-page documents all the time.

I've noticed that there are two major "strategies of caring" used in our sphere:

  • Soares-style caring, where you override your gut feelings (your "internal care-o-meter" as Soares puts it) and use cold calculation to decide.
  • Carlsmith-style caring, where you do your best to align your gut feelings with the knowledge of the pain and suffering the world is filled with, including the suffering you cause.

Nate Soares obviously endorses staring unflinchingly into the abyss that is reality (if you are capable of doing so). However, I expect that almost-pure Soares-style caring (which in essence amounts to "shut up and multiply", and consequentialism) combined with inattention or an inaccurate map of the world (aka broken epistemics) can lead to making severely sub-optimal decisions.

The harder you optimize for a goal, the better your epistemology (and by extension, your understanding of your goal and the world) should be. Carlsmith-style caring seems more effective since it very likely is more robust to having bad epistemology compared to Soares-style caring.

(There are more pieces necessary to make Carlsmith-style caring viable, and a lot of them can be found in Soares' "Replacing Guilt" series.)

Does this come from a general idea of "optimizing hard" means higher risk of damage caused by errors in detail, and "optimizing soft" has enough slack so as not to have the same risks, but also soft is less ambitious and likely less effective (if both are actually implemented well)?

a general idea of “optimizing hard” means higher risk of damage caused by errors in detail

Agreed.

“optimizing soft” has enough slack so as not to have the same risks, but also soft is less ambitious and likely less effective

I disagree with the idea that "optimizing soft" is less ambitious. "Optimizing soft", in my head, is about as ambitious as "optimizing hard", except it makes the epistemic uncertainty more explicit. In this model of caring I am trying to make more legible, I believe that Carlsmith-style caring may be more robust to certain epistemological errors humans can make that can result in severely sub-optimal scenarios, because it is constrained by human cognition and capabilities.

Note: I notice that this can also be said for Soares-style caring -- both are constrained by human cognition and capabilities, but in different ways. Perhaps both have different failure modes, and are more effective in certain distributions (which may diverge)?

Backing up a step, because I'm pretty sure we have different levels of knowledge and assumptions (mostly my failing) about the differences between "hard" and "soft" optimizing.

I should acknowledge that I'm not particularly invested in EA as a community or identity. I try to be effective, and do some good, but I'm exploring rather than advocating here. 

Also, I don't tend to frame things as "how to care", so much as "how to model the effects of actions, and how to use those models to choose how to act".  I suspect that's isomorphic to how you're using "how to care", but I'm not sure of that.

All that said, I think of "optimizing hard" as truly taking seriously the "shut up and multiply" results, even where it's uncomfortable epistemically, BECAUSE that's the only way to actually do the MOST POSSIBLE good.  actually OPTIMIZING, you know?  "soft" is almost by definition less ambitious, BECAUSE it's epistemically more conservative, and gives up average expected value in order to increase modal goodness in the face of that uncertainty.  I don't actually know if those are the positions taken by those people.  I'd love to hear different definitions of "hard" and "soft", so I can better understand why they're both equal in impact.

I predict this is not really an accurate representation of Soares-style caring. (I think there is probably some vibe difference between these two clusters that you're tracking, but I doubt Nate Soares would advocate "overriding" per se)

I doubt Nate Soares would advocate “overriding” per se

Acknowledged, that was an unfair characterization of Nate-style caring. I guess I wanted to make explicit two extremes. Perhaps using the name "Nate-style caring" is a bad idea.

(I now think that "System 1 caring" and "System 2 caring" would have been much better.)

2022-08; Jan Leike, John Schulman, Jeffrey Wu; Our Approach to Alignment Research

OpenAI's strategy, as of the publication of that post, involved scalable alignment approaches. Their philosophy is to take an empirical and iterative approach[1] to finding solutions to the alignment problem. Their strategy for alignment is cyborgism, where they create AI models that are capable and aligned enough to further alignment research enough that they can align even more capable models.[2]

Their research focus is on scalable approaches to direct models[3]. This means that the core of their strategy involves RLHF. They don't expect RLHF to be sufficient on its own, but it is necessary for the other scalable alignment strategies they are looking at[4].

They intend to augment RLHF with AI-assisted scaled up evaluation (ensuring RLHF isn't bottlenecked by a lack of accurate evaluation data for tasks too onerous for baseline humans to evaluate performance for)[5].

Finally, they then intend to use these partially-aligned models to do alignment research, since they anticipate that alignment approaches that work and are viable for low capability models may not be adequate for models with higher capabilities.[6] They intend to use the AI-based evaluation tools to both RLHF-align models, and as part of a process where humans evaluate alignment research produced by these LLMs (here's the cyborgism part of the strategy).[7]

Their "Limitations" section of their blog post does clearly point out the vulnerabilities in their strategy:

  • Their strategies involve using one black box (scalable evaluation models) to align another black box (large LLMs being RLHF-aligned), a strategy I am pessimistic about, although it probably is good enough given low enough capability models
  • They ignore non-Godzilla strategies such as interpretability research and robustness (aka robustness to distribution shift and adverserial attacks - see Stephen Casper's research for an idea about this), and they do intend to hire researchers so their portfolio includes investment in this research direction
  • They may be wrong about achieving the creation of AI models that are partially-aligned and help with alignment research but aren't so capable that they can cause pivotal acts. If so, then the pivotal acts achieved will only be partially aligned to that of the AI wielder and will probably not lead to a good ending.

  1. We take an iterative, empirical approach: by attempting to align highly capable AI systems, we can learn what works and what doesn’t, thus refining our ability to make AI systems safer and more aligned.

    ↩︎
  2. We believe that even without fundamentally new alignment ideas, we can likely build sufficiently aligned AI systems to substantially advance alignment research itself.

    ↩︎
  3. At a high-level, our approach to alignment research focuses on engineering a scalable training signal for very smart AI systems that is aligned with human intent.

    ↩︎
  4. We don’t expect RL from human feedback to be sufficient to align AGI, but it is a core building block for the scalable alignment proposals that we’re most excited about, and so it’s valuable to perfect this methodology.

    ↩︎
  5. RL from human feedback has a fundamental limitation: it assumes that humans can accurately evaluate the tasks our AI systems are doing. Today humans are pretty good at this, but as models become more capable, they will be able to do tasks that are much harder for humans to evaluate (e.g. finding all the flaws in a large codebase or a scientific paper). Our models might learn to tell our human evaluators what they want to hear instead of telling them the truth.

    ↩︎
  6. There is currently no known indefinitely scalable solution to the alignment problem. As AI progress continues, we expect to encounter a number of new alignment problems that we don’t observe yet in current systems. Some of these problems we anticipate now and some of them will be entirely new.

    We believe that finding an indefinitely scalable solution is likely very difficult. Instead, we aim for a more pragmatic approach: building and aligning a system that can make faster and better alignment research progress than humans can.

    ↩︎
  7. We believe that evaluating alignment research is substantially easier than producing it, especially when provided with evaluation assistance. Therefore human researchers will focus more and more of their effort on reviewing alignment research done by AI systems instead of generating this research by themselves. Our goal is to train models to be so aligned that we can off-load almost all of the cognitive labor required for alignment research.

    ↩︎

Sidenote: I like how OpenAI ends their blog posts with an advertisement for positions they are hiring for, or programs they are running. That's a great strategy to advertise to the very people they want to reach.

I recently had to solve a Captcha to submit a reddit post using a new reddit account I made (because I did not use reddit until now). It was an extremely Kafkaesque experience: I tried the Captcha in good faith and Google repeatedly told me I did my job incorrectly, but did not explain why. This went on for multiple minutes, and I kept being told I was doing it wrong, even though I kept clicking on all the right boxes that contained parts of a bicycle or a motorcycle or whatever. The slow fade-in and fade-out images were the worst, and I consider this a form of low level torture when you are made to do this for extended periods of time.

I admit that I use an extremely unique browser setup: portrait mode, OpenBSD amd64 OS, Mozilla Firefox with uBlock Origin, an external keyboard where I use my arrow keys to control the mouse most of the time. I expect that such an out-of-distribution setup may have led the Captcha AI to be suspicious of me. All this was intended to improve my experience of using my machine and interfacing with the Internet. Worse, I was already signed into my Google account, so it didn't make sense that Google would still be suspicious of me being a bot.

I've decided on a systemic solution for this problem:

  1. I shall never manually complete a Captcha again.
  2. I shall use automated captcha-solvers, or failing that, pay other people to solve captchas for me, if I really must bypass a captcha.

One could interpret this as adversarial action against Google and Reddit, but it seems to me that when dealing with an optimizer that is taking constant adversarial action against you, and is credibly unwilling to attempt to co-operate and solve the problem you both face, the next step is to defect. Ideally you extricate yourself from the situation, but in some cases that isn't acceptable given your goals.

I expect that people who are paid to solve captchas probably are numb to this, or have been trained by the system to solve captchas more efficiently, such that they may be optimized for dealing with its Kafkaesque nature. I do not expect to feel like I would be putting them through the pain I would have experienced. I still do not consider it an ideal state of affairs, though.

Yeah, my understanding of how bot detection on lots of these sites work is they track your mouse, then do a simple classification scheme on mouse movements to differentiate between bots and humans. So it's no surprise that moving your mouse with your arrow keys would make the classifier very suspicious.

Just a quote I find rather interesting, since it is rare to see a Hero's Journey narrative with a Return that involves the hero not knowing if he will ever belong or find meaning once he returns, and yet chooses to return, having faith in his ability to find meaning again:

If every living organism has a fixed purpose for its existence, then one thing's for sure. I [...] have completed my mission. I've fulfilled my purpose. But a great amount of power that has served its purpose is a pain to deal with, just like nuclear materials that have reached the end of their lifespan. If that's the case, there'll be a lot of questions. Would I now become an existence that this place doesn't need anymore?
The time will come when the question of whether it's okay for me to remain in this place will be answered.
However...
If there's a reason to remain in this place, then it's probably that there are still people that I love in this place.
And that people who love me are still here.
Which is why that's enough reason for me to stay here.
I'll stay here and find other reasons as to why I should stay here...
That's what I've decided on.

-- The final chapter of Solo Leveling

Thoughts on Tom Everitt, Ramana Kumar, Victoria Krakovna, Shane Legg; 2019; Modeling AGI Safety Frameworks with Causal Influence Diagrams:

Causal Influence Diagrams are interesting, but don't really seem all that useful. Anyway, the latest formal graphical representation for agents that the authors seem to promote are structured causal models so you don't read this paper for object level usefulness but incidental research contributions that are really interesting.

The paper divides AI systems into two major frameworks:

  • MDP-based frameworks (aka RL-based systems such as AlphaZero), which involve AI systems that take actions and are assigned a reward for their actions
  • Question-answering systems (which includes all supervised learning systems, including sequence modellers like GPT), were the system gives an output given an input and is scored based on a label of the same data type as the output. This is also informally known as tool AI (they cite Gwern's post, which is nice to see).

I liked how lucidly they defined wireheading:

In the basic MDP from Figure 1, the reward parameter ΘR is assumed to be unchanging. In reality, this assumption may fail because the reward function is computed by some physical system that is a modifiable part of the state of the world. [...] This gives an incentive for the agent to obtain more reward by influencing the reward function rather than optimizing the state, sometimes called wireheading.

The common definition of wireheading is informal enough that different people would map it to different specific formalizations in their head (or perhaps have no formalization and therefore be confused), and having this 'more formal' definition in my head seems rather useful.

Here's their distillation for Current RF-optimization, a strategy to avoid wireheading (which reminds me of shard theory, now that I think about it -- models that avoid wireheading by modelling effects of resulting changes to policy and then deciding what trajectory of actions to take):

An elegant solution to this problem is to use model-based agents that simulate the state sequence likely to result from different policies, and evaluate those state sequences according to the current or initial reward function.

Here's their distillation of Reward Modelling:

A key challenge when scaling RL to environments beyond board games or computer games is that it is hard to define good reward functions. Reward Modeling [Leike et al., 2018] is a safety framework in which the agent learns a reward model from human feedback while interacting with the environment. The feedback could be in the form of preferences, demonstrations, real-valued rewards, or reward sketches. [...] Reward modeling can also be done recursively, using previously trained agents to help with the training of more powerful agents [Leike et al., 2018].

The resulting CI diagram modelling actually made me feel like I grokked Reward Modelling better.

Here's their distillation of CIRL:

Another way for agents to learn the reward function while interacting with the environment is Cooperative Inverse Reinforcement Learning (CIRL) [Hadfield-Menell et al., 2016]. Here the agent and the human inhabit a joint environment. The human and the agent jointly optimize the sum of rewards, but only the human knows what the rewards are. The agent has to infer the rewards by looking at the human’s actions.

The difference between RM and CIRL causal influence diagrams is interesting, because there is a subtle difference. The authors imply that this minor difference matters and can imply different things about system incentives and therefore safety guarantees, and I am enthusiastic about such strategies for investigating safety guarantees.

The authors describe a wireheading-equivalent for QA systems called self-fulfilling prophecies:

The assumption that the labels are generated independently of the agent’s answer sometimes fails to hold. For example, the label for an online stock price prediction system could be produced after trades have been made based on its prediction. In this case, the QA-system has an incentive to make self-fulfilling prophecies. For example, it may predict that the stock will have zero value in a week. If sufficiently trusted, this prediction may lead the company behind the stock to quickly go bankrupt. Since the answer turned out to be accurate, the QA-system would get full reward. This problematic incentive is represented in the diagram in Figure 9, where we can see that the QA-system has both incentive and ability to affect the world state with its answer [Everitt et al., 2019].

They propose a solution to the self-fulfilling prophecies problem, via making oracles optimize for reward in the counterfactual world where their answer doesn't influence the world state and therefore the label which they are optimized for. While that is a solution, I am unsure how one can get counterfactual labels for complicated questions whose answers may have far reaching consequences in the world.

It is possible to fix the incentive for making self-fulfilling prophecies while retaining the possibility to ask questions where the correctness of the answer depends on the resulting state. Counterfactual oracles optimize reward in the counterfactual world where no one reads the answer [Armstrong, 2017]. This solution can be represented with a twin network [Balke and Pearl, 1994] influence diagram, as shown in Figure 10. Here, we can see that the QA-system’s incentive to influence the (actual) world state has vanished, since the actual world state does not influence the QA-system’s reward; thereby the incentive to make self-fulfilling prophecies also vanishes. We expect this type of solution to be applicable to incentive problems in many other contexts as well.

The authors also anticipate this problem but instead of considering whether and how one can tractably calculate counterfactual labels, they connect this intractability to introducting the debate AI safety strategy:

To fix this, Irving et al. [2018] suggest pitting two QA-systems against each other in a debate about the best course of action. The systems both make their own proposals, and can subsequently make arguments about why their own suggestion is better than their opponent’s. The system who manages to convince the user gets rewarded; the other system does not. While there is no guarantee that the winning answer is correct, the setup provides the user with a powerful way to poke holes in any suggested answer, and reward can be dispensed without waiting to see the actual result.

I like how they explicitly mention that there is no guarantee that the winning answer is correct, which makes me more enthusiastic about considering debate as a potential strategy.

They also have an incredibly lucid distillation of IDA. Seriously, this is significantly better than all the Paul Christiano posts I've read and the informal conversations I've had about IDA:

Iterated distillation and amplification (IDA) [Christiano et al., 2018] is another suggestion that can be used for training QA-systems to correctly answer questions where it is hard for an unaided user to directly determine their correctness. Given an original question Q that is hard to answer correctly, less powerful systems Xk are asked to answer a set of simpler questions Qi. By combining the answers Ai to the simpler questions Qi, the user can guess the answer ˆA to Q. A more powerful system Xk+1 is trained to answer Q, with ˆA used as an approximation of the correct answer to Q.

Once the more powerful system Xk+1 has been trained, the process can be repeated. Now an even more powerful QA-system Xk+2 can be trained, by using Xk+1 to answer simpler questions to provide approximate answers for training Xk+2. Systems may also be trained to find good subquestions, and for aggregating answers to subquestions into answer approximations. In addition to supervised learning, IDA can also be applied to reinforcement learning.

I have no idea why they included Drexler's CAIS -- but it is better than reading 300 pages of the original paper:

Drexler [2019] argues that the main safety concern from artificial intelligence does not come from a single agent, but rather from big collections of AI services. For example, one service may provide a world model, another provide planning ability, a third decision making, and so on. As an aggregate, these services can be very competent, even though each service only has access to a limited amount of resources and only optimizes a short-term goal.

The authors claim that the AI safety issues commonly discussed can be derived 'downstream' of modelling these systems more formally, using these causal influence diagrams. I disagree, due to the amount of degrees of freedom the modeller is given when making these diagrams.

In the discussion section, the authors talk about the assumptions underlying the representations, and their limitations. They explicitly point out how the intensional stance may be limiting and not model certain classes of AI systems or agents (hint: read their newer papers!)

Overall, the paper was an easy and fun read, and I loved the distillations of AI safety approaches in them. I'm excited to read papers by this group.

I want to differentiate between categories of capabilities improvement in AI systems, and here's the set of terms I've come up with to think about them:

  • Infrastructure improvements: Capability boost in the infrastructure that makes up an AI system. This involves software (Pytorch, CUDA), hardware (NVIDIA GPUs), operating systems, networking, the physical environment where the infrastructure is situated. This probably is not the lowest hanging fruit when it comes to capabilities acceleration.

  • Scaffolding improvements: Capability boost in an AI system that involves augmenting the AI system via software features. Think of it as keeping the CPU of the natural language computer the same, but upgrading its RAM and SSD and IO devices. Some examples off the top of my head: hyperparameter optimization for generating text, use of plugins, embeddings for memory. More information is in beren's essay linked in this paragraph.

  • Neural network improvements: Any capability boost in an AI system that specifically involves improving the black-box neural network that drives the system. This is mainly what SOTA ML researchers focus on, and is what has driven the AI hype over the past decade. This can involve architectural improvements, training improvements, finetuning afterwards (RLHF to me counts as capabilities acceleration via neural network improvements), etc.

There probably are more categories, or finer ways to slice the space of capability acceleration mechanisms, but I haven't thought about this in as much detail yet.

As far as I can tell, both capabilities augmentation and capabilities acceleration contribute to achieving recursive self-improving (RSI) systems, and once you hit that point, foom is inevitable.

Alignment agendas can generally be classified into two categories: blueprint-driven and component-driven. Understanding this distinction is probably valuable for evaluating and comprehending different agendas.

Blueprint-driven alignment agendas are approaches that start with a coherent blueprint for solving the alignment problem. They prioritize the overall structure and goals of the solution before searching for individual components or building blocks that fit within that blueprint. Examples of blueprint-driven agendas include MIRI's agent foundations, Vanessa Kosoy and Diffractor's Infrabayesianism, and carado's formal alignment agenda. Research aimed at developing a more accurate blueprint, such as Nate Soares' 2022-now posts, Adam Shimi's epistemology-focused output, and John Wentworth's deconfusion-style output, also fall into this category.

Component-driven alignment agendas, on the other hand, begin with available components and seek to develop new pieces that work well with existing ones. They focus on making incremental progress by developing new components that can be feasibly implemented and integrated with existing AI systems or techniques to address the alignment problem. OpenAI's strategy, Deepmind's strategy, Conjecture's LLM-focused outputs, and Anthropic's strategy are examples of this approach. Agendas that serve as temporary solutions by providing useful components that integrate with existing ones, such as ARC's power-seeking evals, also fall under the component-driven category. Additionally, the Cyborgism agenda and the Accelerating Alignment agenda can be considered component-driven.

The blueprint-driven and component-driven categorization seems to me to be more informative than dividing agendas into conceptual and empirical categories. This is because all viable alignment agendas require a combination of conceptual and empirical research. Categorizing agendas based on the superficial pattern of their current research phase can be misleading. For instance, shard theory may initially appear to be a blueprint-driven conceptual agenda, like embedded agency. However, it is actually a component-driven agenda, as it involves developing pieces that fit with existing components.

Given the significant limitations of using a classifier to detect AI generated text, it seems strange to me that OpenAI went ahead and built one and threw it out for the public to try. As far as I can tell, this is OpenAI aggressively acting to cover its bases for potential legal and PR damages due to ChatGPT's existence.

For me this is a slight positive evidence for the idea that AI Governance may actually be useful in extending the timelines, but only if it involves adverserial actions that act on the vulnerabilities of these companies. But even then, that seems like a myopic decision given the existence of other, less controllable actors (like China), racing as fast as possible towards AGI.

Jan Hendrik Kirchner now works at OpenAI, it seems, given that he is listed as the author of this blog post. I don't see this listed on his profile or on his substack or twitter account, so this is news to me.