All of capybaralet's Comments + Replies

Causal confusion as an argument against the scaling hypothesis

I think you're moving the goal-posts, since before you mentioned "without external calculators".  I think external tools are likely to be critical to doing this, and I'm much more optimistic about that path to doing this kind of robust generalization.  I don't think that necessarily addresses concerns about how the system reasons internally, though, which still seems likely to be critical for alignment.

Causal confusion as an argument against the scaling hypothesis

I disagree; I think we have intuitive theories of causality (like intuitive physics) that are very helpful for human learning and intelligence.  

Causal confusion as an argument against the scaling hypothesis

RE GPT-3, etc. doing well on math problems: the key word in my response was "robustly".  I think there is a big qualitative difference between "doing a good job on a certain distribution of math problems" and "doing math (robustly)".  This could be obscured by the fact that people also make mathematical errors sometimes, but I think the type of errors is importantly different from those made by DNNs.
 

2Owain Evans15d
This is a distribution of math problems GPT-3 wasn't finetuned on. Yet it's able to few-shot generalize and perform well. This is an amazing level of robustness relative to 2018 deep learning systems. I don't see why scaling and access to external tools (e.g. to perform long calculations) wouldn't produce the kind of robustness you have in mind.
Causal confusion as an argument against the scaling hypothesis

Are you aware of any examples of the opposite happening?  I guess it should for some tasks.

Causal confusion as an argument against the scaling hypothesis

I can interpret your argument as being only about the behavior of the system, in which case:
- I agree that models are likely to learn to imitate human dialogue about causality, and this will require some amount of some form of causal reasoning.
- I'm somewhat skeptical that models will actually be able to robustly learn these kinds of abstractions with a reasonable amount of scaling, but it certainly seems highly plausible.

I can also interpret your argument as being about the internal reasoning of the system, in which case:
- I put this in the "deep learning... (read more)

3Owain Evans15d
GPT-3 (without external calculators) can do very well on math word problems (https://arxiv.org/abs/2206.02336) that combine basic facts about the world with abstract math reasoning. Why think that the kind of causal reasoning humans do is out of reach of scaling (especially if you allow external calculators)? It doesn't seem different in kind from these math word problems. Some human causal reasoning is explicit. Humans can't do complex and exact calculations using System 1 intuition, and neither can we do causal reasoning of any sophistication using System 1. The prior over causal relations (e.g. that without looking at any data 'smoking causes cancer' is way more likely than the reverse) is more about general world-model building, and maybe there's more uncertainty about how well scaling learns that.
0Noosphere8915d
The important part of his argument is in the second paragraph, and I agree because by and large, pretty much everything we know about science and casuality, at least in the beginning for AI is on trusting the scientific papers and experts. Virtually no knowledge is given by experimentation, but instead by trusting the papers, experts and books.
AGI Ruin: A List of Lethalities

I basically agree.
 

I am arguing against extreme levels of pessimism (~>99% doom).  

AGI Ruin: A List of Lethalities

While I share a large degree of pessimism for similar reasons, I am somewhat more optimistic overall.  

Most of this comes from generic uncertainty and epistemic humility; I'm a big fan of the inside view, but it's worth noting that this can (roughly) be read as a set of 42 statements that need to be true for us to in fact be doomed, and statistically speaking it seems unlikely that all of these statements are true.

However, there are some more specific points I can point to where I think you are overconfident, or at least not providing good reasons for... (read more)

2Rob Bensinger1mo
I don't think these statements all need to be true in order for p(doom) to be high, and I also don't think they're independent. Indeed, they seem more disjunctive than conjunctive to me; there are many cases where any one of the claims being true increases risk substantially, even if many others are false.
[$20K in Prizes] AI Safety Arguments Competition

What about graphics?  e.g. https://twitter.com/DavidSKrueger/status/1520782213175992320

Intuitions about solving hard problems

This is Eliezer’s description of the core insight behind Paul’s imitative amplification proposal. I find this somewhat compelling, but less so than I used to, since I’ve realized that the line between imitation learning and reinforcement learning is blurrier than I used to think (e.g. see this or this).

 

I didn't understand what you mean by the line being blurrier... Is this a comment about what works in practice for imitation learning?  Does a similar objection apply if we replace imitation

 learning with behavioral cloning?

Intuitions about solving hard problems

Weight-sharing makes deception much harder.

Can I read about that somewhere?  Or could you briefly elaborate?

Takeoff speeds have a huge effect on what it means to work on AI x-risk

It's possible that a lot of our disagreement is due to different definitions of "research on alignment", where you would only count things that (e.g.) 1) are specifically about alignment that likely scales to superintelligent systems, or 2) is motivated by X safety.  

To push back on that a little bit...
RE (1): It's not obvious what will scale, And I think historically this community has been too pessimistic (i.e. almost completely dismissive) about approaches that seem hacky or heuristic.  
RE (2): This is basically circular.

5Adam Shimi3mo
I disagree, so I'm curious about what are great examples for you of good research on alignment that is not done by x-risk motivated people? (Not being dismissive, I'm genuinely curious, and discussing specifics sounds more promising than downvoting you to oblivion and not having a conversation at all).
Takeoff speeds have a huge effect on what it means to work on AI x-risk

In particular, in a fast takeoff world, AI takeover risk never looks much more obvious than it does now, and so x-risk-motivated people should be assumed to cause the majority of the research on alignment that happens.

I strongly disagree with that and I don't think it follows from the premise.  I think by most reasonable definitions of alignment it is already the case that most of the research is not done by x-risk motivated people. 

Furthermore, I think it reflects poorly on this community that this sort of sentiment seems to be common. 

3David Krueger3mo
It's possible that a lot of our disagreement is due to different definitions of "research on alignment", where you would only count things that (e.g.) 1) are specifically about alignment that likely scales to superintelligent systems, or 2) is motivated by X safety. To push back on that a little bit... RE (1): It's not obvious what will scale, And I think historically this community has been too pessimistic (i.e. almost completely dismissive) about approaches that seem hacky or heuristic. RE (2): This is basically circular.
Discussion with Eliezer Yudkowsky on AGI interventions

I'm familiar with these claims, and (I believe) the principle supporting arguments that have been made publicly.   I think I understand them reasonably well.

I don't find them decisive.  Some aren't even particularly convincing.  A few points:

- EY sets up a false dichotomy between "train in safe regimes" and "train in dangerous regimes". In the approaches under discussion there is an ongoing effort (e.g. involving some form(s) of training) to align the system, and the proposal is to keep this effort ahead of advances in capability (in some se... (read more)

1Ramana Kumar7mo
This sounds confused to me: the intelligence is the "qualitatively new thought processes". The thought processes aren't some new regime that intelligence has to generalize to. Also to answer the question directly, I think the claim is that intelligence (which I'd say is synonymous for these purposes with capability) is simpler and more natural than corrigibility (i.e., the last claim - I don't think these three claims are to be taken separately). People keep saying this but it seems false to me. I've seen the construction for history-based utility functions that's supposed to show this, and don't find it compelling -- it seems not to be engaging with what EY is getting at with "coherent planning behavior". Is there a construction for (environment)-state-based utility functions? I'm not saying that is exactly the right formalism to demonstrate the relationship between coherent behaviour and utility functions, but it seems to me closer to the spirit of what EY is getting at. (This comment thread [https://www.alignmentforum.org/posts/vphFJzK3mWA4PJKAg/coherent-behaviour-in-the-real-world-is-an-incoherent?commentId=F2YB5aJgDdK9ZGspw] on the topic seems pretty unresolved to me.)
Discussion with Eliezer Yudkowsky on AGI interventions

I guess actually the goal is just to get something aligned enough to do a pivotal act.  I don't see though why an approach that tries to maintain a relatively-sufficient level of alignment (relative to current capabilities) as capabilities scale couldn't work for that.

Yudkowsky mentions this briefly in the middle of the dialogue:

 I don't know however if I should be explaining at this point why "manipulate humans" is convergent, why "conceal that you are manipulating humans" is convergent, why you have to train in safe regimes in order to get safety in dangerous regimes (because if you try to "train" at a sufficiently unsafe level, the output of the unaligned system deceives you into labeling it incorrectly and/or kills you before you can label the outputs), or why attempts to teach corrigibility in safe regimes are

... (read more)
Comments on Carlsmith's “Is power-seeking AI an existential risk?”

Different views about the fundamental difficulty of inner alignment seem to be a (the?) major driver of differences in views about how likely AI X risk is overall. 

I strongly disagree with inner alignment being the correct crux.  It does seem to be true that this is in fact a crux for many people, but I think this is a mistake.  It is certainly significant.  

But I think optimism about outer alignment and global coordination ("Catch-22 vs. Saving Private Ryan") is much bigger factor, and optimists are badly wrong on both points here. 

Discussion with Eliezer Yudkowsky on AGI interventions

I'm torn because I mostly agree with Eliezer that things don't look good, and most technical approaches don't seem very promising. 

But the attitude of unmitigated doomyness seems counter-productive.
And there's obviously things worth doing and working on and people getting on with it.

It seems like Eliezer is implicitly focused on finding an "ultimate solution" to alignment that we can be highly confident solves the problem regardless of how things play out.  But this is not where the expected utility is. The expected utility is mostly in buying ti... (read more)

4David Krueger8mo
I guess actually the goal is just to get something aligned enough to do a pivotal act. I don't see though why an approach that tries to maintain a relatively-sufficient level of alignment (relative to current capabilities) as capabilities scale couldn't work for that.
LCDT, A Myopic Decision Theory

Better, but I still think "myopia" is basically misleading here.  I would go back to the drawing board *shrug.

LCDT, A Myopic Decision Theory

It seems a bit weird to me to call this myopia, since (IIUC) the AI is still planning for future impact (just not on other agents).  

3Adam Shimi10mo
That's fair, but I still think this capture a form of selective myopia. The trick is to be just myopic enough to not be deceptive, while still being able to plan for future impact when it is useful but not deceptive. What do you think of the alternative names "selective myopia" or "agent myopia"?
[AN #156]: The scaling hypothesis: a plan for building AGI

I think the contradiction may only be apparent, but I thought it was worth mentioning anyways.  
My point was just that we might actually want certifications to say things about specific algorithms.

[AN #156]: The scaling hypothesis: a plan for building AGI

Second, we can match the certification to the types of people and institutions, that is, our certifications talk about the executives, citizens, or corporations (rather than e.g. specific algorithms, that may be replaced in the future). Third, the certification system can build in mechanisms for updating the certification criteria periodically.

* I think effective certification is likely to involve expert analysis (including non-technical domain experts) of specific algorithms used in specific contexts.  This appears to contradict the "Second" point ab... (read more)

2Rohin Shah1y
The idea with the "Second" point is that the certification would be something like "we certify that company X has a process Y for analyzing and fixing potential problem Z whenever they build a new algorithm / product", which seems like it is consistent with your belief here? Unless you think that the process isn't enough, you need to certify the analysis itself.
AI x-risk reduction: why I chose academia over industry

You can try to partner with industry, and/or advocate for big government $$$.
I am generally more optimistic about toy problems than most people, I think, even for things like Debate.
Also, scaling laws can probably help here.

AI x-risk reduction: why I chose academia over industry

um sorta modulo a type error... risk is risk.  It doesn't mean the thing has happened (we need to start using some sort of phrase like "x-event" or something for that, I think).

AI x-risk reduction: why I chose academia over industry

Yeah we've definitely discussed it!  Rereading what I wrote, I did not clearly communicate what I intended to...I wanted to say that "I think the average trend was for people to update in my direction".  I will edit it accordingly.

I think the strength of the "usual reasons" has a lot to do with personal fit and what kind of research one wants to do.  Personally, I basically didn't consider salary as a factor.

AI x-risk reduction: why I chose academia over industry

When you say academia looks like a clear win within 5-10 years, is that assuming "academia" means "starting a tenure-track job now?" If instead one is considering whether to begin a PhD program, for example, would you say that the clear win range is more like 10-15 years?

Yes.  

Also, how important is being at a top-20 institution? If the tenure track offer was instead from University of Nowhere, would you change your recommendation and say go to industry?

My cut-off was probably somewhere between top-50 and top-100, and I was prepared to go anywhere in ... (read more)

2Daniel Kokotajlo1y
Makes sense. I think we don't disagree dramatically then. Also makes sense -- just checking, does x-risk-inducing AI roughly match the concept of "AI-induced potential point of no return" [https://www.lesswrong.com/posts/JPan54R525D68NoEt/the-date-of-ai-takeover-is-not-the-day-the-ai-takes-over] or is it importantly different? It's certainly less of a mouthful so if it means roughly the same thing maybe I'll switch terms. :)
"Beliefs" vs. "Notions"

Thanks!  Quick question: how do you think these notions compare to factors in an undirected graphical model?  (This is the closest thing I know of to how I imagine "notions" being formalized).

1Vanessa Kosoy1y
Hmm. I didn't encounter this terminology before, but, given a graph and a factor you can consider the convex hull of all probability distributions compatible with this graph and factor (i.e. all probability distributions obtained by assigning other factors to the other cliques in the graph). This is a crisp infradistribution. So, in this sense you can say factors are a special case of infradistributions (although I don't know how much information this transformation loses). It's more natural to consider, instead of a factor, either the marginal probability distribution of a set of variables or the conditional probability distribution of a set of variables on a different set of variables. Specifying one of those is a linear condition on the full distribution so it gives you a crisp infradistribution without having to take convex hull, and no information is lost.
"Beliefs" vs. "Notions"

Cool!  Can you give a more specific link please?

1Vanessa Kosoy1y
The concept of infradistribution was defined here [https://www.lesswrong.com/posts/YAa4qcMyoucRS2Ykr/basic-inframeasure-theory] (Definition 7) although for the current purpose it's sufficient to use crisp infradistributions (Definition 9 here [https://www.lesswrong.com/s/CmrW8fCmSLK7E25sa/p/idP5E5XhJGh9T5Yq9], it's just a compact convex set of probability distributions). Sharp infradistributions (Definition 10 here [https://www.lesswrong.com/s/CmrW8fCmSLK7E25sa/p/idP5E5XhJGh9T5Yq9]) are the special case of "pure (2)". I also talked about the connection to formal logic here [https://www.lesswrong.com/posts/dPmmuaz9szk26BkmD/?commentId=6BhuJdyMScbyQeeac] .
"Beliefs" vs. "Notions"

True, but it seems the meaning I'm using it for is primary:
 

Imitative Generalisation (AKA 'Learning the Prior')

It seems like z* is meant to represent "what the human thinks the task is, based on looking at D".
So why not just try to extract the posterior directly, instead of the prior an the likelihood separately?
(And then it seems like this whole thing reduces to "ask a human to specify the task".)

1Beth Barnes1y
We're trying to address cases where the human isn't actually able to update on all of D and form a posterior based on that. We're trying to approximate 'what the human posterior would be if they had been able to look at all of D'. So to do that, we learn the human prior, and we learn the human likelihood, then have the ML do the computationally-intensive part of looking at all of D and updating based on everything in there. Does that make sense?
[AN #141]: The case for practicing alignment work on GPT-3 and other large models

Intersting... Maybe this comes down to different taste or something.  I understand, but don't agree with, the cow analogy... I'm not sure why, but one difference is that I think we know more about cows than DNNs or something.

I haven't thought about the Zipf-distributed thing.

> Taken literally, this is easy to do. Neural nets often get the right answer on never-before-seen data points, whereas Hutter's model doesn't. Presumably you mean something else but idk what.

I'd like to see Hutter's model "translated" a bit to DNNs, e.g. by assuming they get a... (read more)

2Rohin Shah1y
With this assumption, asymptotically (i.e. with enough data) this becomes a nearest neighbor classifier. For thed-dimensional manifold assumption in the other model, you can apply the arguments from the other model to say that you scale asD−c/dfor some constantc(probably c = 1 or 2, depending on what exactly we're quantifying the scaling of). I'm not entirely sure how you'd generalize the Zipf assumption to the "within epsilon" case, since in the original model there was no assumption on the smoothness of the function being predicted (i.e. [0, 0, 0] and [0, 0, 0.000001] could have completely different values.)
[AN #141]: The case for practicing alignment work on GPT-3 and other large models

I have a hard time saying which of the scaling laws explanations I like better (I haven't read either paper in detail, but I think I got the gist of both).
What's interesting about Hutter's is that the model is so simple, and doesn't require generalization at all. 
I feel like there's a pretty strong Occam's Razor-esque argument for preferring Hutter's model, even though it seems wildly less intuitive to me.
Or maybe what I want to say is more like "Hutter's model DEMANDS refutation/falsification".

I think both models also are very interesting for underst... (read more)

3Rohin Shah1y
?? Overall this claim feels to me like: * Observing that cows don't float into space * Making a model of spherical cows with constant density ρ and showing that as long as ρ is more than density of air, the cows won't float * Concluding that since the model is so simple, Occam's Razor says that cows must be spherical with constant density. Some ways that you could refute it: * It requires your data to be Zipf-distributed -- why expect that to be true? * The simplicity comes from being further away from normal neural nets -- surely the one that's closer to neural nets is more likely to be true? Taken literally, this is easy to do. Neural nets often get the right answer on never-before-seen data points, whereas Hutter's model doesn't. Presumably you mean something else but idk what.
The case for aligning narrowly superhuman models

Thanks for the response!
I see the approaches as more complimentary.  
Again, I think this is in keeping with standard/good ML practice.

A prototypical ML paper might first describe a motivating intuition, then formalize it via a formal model and demonstrate the intuition in that model (empirically or theoretically), then finally show the effect on real data.

The problem with only doing the real data (i.e. at scale) experiments is that it can be hard to isolate the phenomena you wish to study.  And so a positive result does less to confirm the motiva... (read more)

The case for aligning narrowly superhuman models

I haven't read this in detail (hope to in the future); I only skimmed based on section headers.
I think the stuff about "what kinds of projects count" and "advantages over other genres" seem to miss an important alternative, which is to build and study toy models of the phenomena we care about.  This is a bit like the gridworlds stuff, but I thought the description of that work missed its potential, and didn't provide much of an argument for why working at scale would be more valuable.

This approach (building and studying toy models) is popular in ML re... (read more)

3Ajeya Cotra1y
The case in my mind for preferring to elicit and solve problems at scale rather than in toy demos (when that's possible) is pretty broad and outside-view, but I'd nonetheless bet on it: I think a general bias toward wanting to "practice something as close to the real thing as possible" is likely to be productive. In terms of the more specific benefits I laid out in this section [https://www.lesswrong.com/posts/PZtsoaoSLpKjjbMqM/the-case-for-aligning-narrowly-superhuman-models#How_this_work_could_reduce_long_term_AI_x_risk] , I think that toy demos are less likely to have the first and second benefits ("Practical know-how and infrastructure" and "Better AI situation in the run-up to superintelligence"), and I think they may miss some ways to get the third benefit ("Discovering or verifying a long-term solution") because some viable long-term solutions may depend on some details about how large models tend to behave. I do agree that working with larger models is more expensive and time-consuming, and sometimes it makes sense to work in a toy environment instead, but other things being equal I think it's more likely that demos done at scale will continue to work for superintelligent systems, so it's exciting that this is starting to become practical.
Fun with +12 OOMs of Compute

There's a ton of work in meta-learning, including Neural Architecture Search (NAS).  AIGA's (Clune) is a paper that argues a similar POV to what I would describe here, so I'd check that out.  

I'll just say "why it would be powerful": the promise of meta-learning is that -- just like learned features outperform engineered features -- learned learning algorithms will eventually outperform engineered learning algorithms.  Taking the analogy seriously would suggest that the performance gap will be large -- a quantitative step-change.  

The u... (read more)

Fun with +12 OOMs of Compute

Sure, but in what way?
Also I'd be happy to do a quick video chat if that would help (PM me).

1Daniel Kokotajlo1y
Well, I've got five tentative answers to Question One in this post. Roughly, they are: Souped-up AlphaStar, Souped-up GPT, Evolution Lite, Engineering Simulation, and Emulation Lite. Five different research programs basically. It sounds like what you are talking about is sufficiently different from these five, and also sufficiently promising/powerful/'fun', that it would be a worthy addition to the list basically. So, to flesh it out, maybe you could say something like "Here are some examples of meta-learning/NAS/AIGA in practice today. Here's a sketch of what you could do if you scaled all this up +12 OOMs. Here's some argument for why this would be really powerful."
Fun with +12 OOMs of Compute

I only read the prompt.  
But I want to say: that much compute would be useful for meta-learning/NAS/AIGAs, not just scaling up DNNs.  I think that would likely be a more productive research direction.  And I want to make sure that people are not ONLY imagining bigger DNNs when they imagine having a bunch more compute, but also imagining how it could be used to drive fundamental advances in ML algos, which could plausibly kick of something like recursive self-improvement (even in DNNs are in some sense a dead end).

1Daniel Kokotajlo1y
Interesting, could you elaborate? I'd love to have a nice, fleshed-out answer along those lines to add to the five I came up with. :)
the scaling “inconsistency”: openAI’s new insight

if your model gets more sample-efficient as it gets larger & n gets larger, it's because it's increasingly approaching a Bayes-optimal learner and so it gets more out of the more data, but then when you hit the Bayes-limit, how are you going to learn more from each datapoint? You have to switch over to a different and inferior scaling law. You can't squeeze blood from a stone; once you approach the intrinsic entropy, there's not much to learn.


I found this confusing.  It sort of seems like you're assuming that a Bayes-optimal learner achieves the B... (read more)

AGI safety from first principles: Control

Which previous arguments are you referring to?

2Richard Ngo2y
The rest of the AGI safety from first principles sequence. This is the penultimate section; sorry if that wasn't apparent. For the rest of it, start here [https://www.lesswrong.com/s/mzgtmmTKKn5MuCzFJ/p/8xRSjC76HasLnMGSf].
Radical Probabilism [Transcript]

Abram Demski: But it's like, how do you do that if “I don't have a good hypothesis” doesn't make any predictions?

One way you can imagine this working is that you treat “I don't have a good hypothesis” as a special hypothesis that is not required to normalize to 1.  
For instance, it could say that observing any particular real number, r, has probability epsilon > 0.
So now it "makes predictions", but this doesn't just collapse to including another hypothesis and using Bayes rule.

You can also imagine updating this special hypothesis (which I called a "Socratic hypothesis" in comments on the original blog post on Radical Probabilism) in various ways. 

[AN #118]: Risks, solutions, and prioritization in a world with many AI systems

Regarding ARCHES, as an author:

  • I disagree with Critch that we should expect single/single delegation(/alignment) to be solved "by default" because of economic incentives.  I think economic incentives will not lead to it being solved well-enough, soon enough (e.g. see:
     https://www.lesswrong.com/posts/DmLg3Q4ZywCj6jHBL/capybaralet-s-shortform?commentId=wBc2cZaDEBX2rb4GQ)  I guess Critch might put this in the "multi/multi" camp, but I think it's more general (e.g. I attribute a lot of the risk here to human irrationality/carelessness)
  • RE: "I fin
... (read more)
2Rohin Shah2y
Indeed, this is where my 10% comes from, and may be a significant part of the reason I focus on intent alignment whereas Critch would focus on multi/multi stuff. Basically all of my arguments for "we'll be fine" rely on not having a huge discontinuity like that, so while I roughly agree with your prediction in that thought experiment, it's not very persuasive. (The arguments do not rely on technological progress remaining at its current pace.) At least in the US, our institutions are succeeding at providing public infrastructure (roads, water, electricity...), not having nuclear war, ensuring children can read, and allowing me to generally trust the people around me despite not knowing them. Deepfakes and facial recognition are small potatoes compared to that. I agree this is overall a point against my position (though I probably don't think it is as strong as you think it is).
[AN #118]: Risks, solutions, and prioritization in a world with many AI systems

these usually don’t assume “no intervention from longtermists”

I think the "don't" is a typo?

3Rohin Shah2y
No, I meant it as written. People usually give numbers without any assumptions attached, which I would assume means "I predict that in our actual world there is an X% chance of an existential catastrophe due to AI".
Why GPT wants to mesa-optimize & how we might change this

By managing incentives I expect we can, in practice, do things like: "[telling it to] restrict its lookahead to particular domains"... or remove any incentive for control of the environment.

I think we're talking past each other a bit here.

Why GPT wants to mesa-optimize & how we might change this

My intuitions on this matter are:
1) Stopping mesa-optimizing completely seems mad hard.
2) Managing "incentives" is the best way to deal with this stuff, and will probably scale to something like 1,000,000x human intelligence. 
3) On the other hand, it's probably won't scale forever.

To elaborate on the incentive management thing... if we figure that stuff out and do it right and it has the promise that I think it does... then it won't restrict lookahead to particular domains, but it will remove incentives for instrumental goal seeking.  

If we're st... (read more)

2John Maxwell2y
As I mentioned in the post, I don't think this is a binary, and stopping mesa-optimization "incompletely" seems pretty useful. I also have a lot of ideas about how to stop it, so it doesn't seem mad hard to me. I'm less optimistic about this approach. 1. There is a stochastic aspect to training ML models, so it's not enough to say "the incentives favor Mesa-Optimizing for X over Mesa-Optimizing for Y". If Mesa-Optimizing for Y is nearby in model-space, we're liable to stumble across it. 2. Even if your mesa-optimizer is aligned, if it doesn't have a way to stop mesa-optimization, there's the possibility that your mesa-optimizer would develop another mesa-optimizer inside itself which isn't necessarily aligned. 3. I'm picturing [https://www.lesswrong.com/posts/Nwgdq6kHke5LY692J/alignment-by-default] value learning via (un)supervised learning, and I don't see an easy way to control the incentives of any mesa-optimizer that develops in the context of (un)supervised learning. (Curious to hear about your ideas though.) My intuition is that the distance between Mesa-Optimizing for X and Mesa-Optimizing for Y is likely to be smaller than the distance between an Incompetent Mesa-Optimizer and a Competent Mesa-Optimizer. If you're shooting for a Competent Human Values Mesa-Optimizer, it would be easy to stumble across a Competent Not Quite Human Values Mesa-Optimizer along the way. All it would take would be having the "Competent" part in place before the "Human Values" part. And running a Competent Not Quite Human Values Mesa-Optimizer during training is likely to be dangerous. On the other hand, if we have methods for detecting mesa-optimization or starving it of compute that work reasonably well, we're liable to stumble across an Incompetent Mesa-Optimizer and run it a few times, but it's less likely that we'll hit the smaller target of a Competent Mesa-Optimizer.
Why GPT wants to mesa-optimize & how we might change this

I didn't read the post (yet...), but I'm immediately skeptical of the claim that beam search is useful here ("in principle"), since GPT-3 is just doing next step prediction (it is never trained on its own outputs, IIUC). This means it should always just match the conditional P(x_t | x_1, .., x_{t-1}). That conditional itself can be viewed as being informed by possible future sequences, but conservation of expected evidence says we shouldn't be able to gain anything by doing beam search if we already know that conditional. Now it... (read more)

2John Maxwell2y
Yeah, that's the possibility the post explores. Is there an easy way to detect if it's started doing that / tell it to restrict its lookahead to particular domains? If not, it may be easier to just prevent it from mesa-optimizing in the first place. (The post has arguments for why that's (a) possible and (b) wouldn't necessarily involve a big performance penalty.)
Why GPT wants to mesa-optimize & how we might change this

Seq2seq used beam search and found it helped (https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43155.pdf). It was standard practice in the early days of NMT; I'm not sure when that changed.

This blog post gives some insight into why beam search might not be a good idea, and is generally very interesting: https://benanne.github.io/2020/09/01/typicality.html

It still is, it's just that beam search (or other search strategies) seem to be mostly useful for closed-end short text generation; translating a sentence apparently is a task with enough of a right-or-wrong-ness to it that beam search apparently taps into no pathologies. But they get exposed for open-ended longform generation.

Radical Probabilism

This blog post seems superficially similar, but I can't say ATM if there are any interesting/meaningful connections:

https://www.inference.vc/the-secular-bayesian-using-belief-distributions-without-really-believing/

1Ben Pace2y
I listened to this yesterday! Was quite interesting, I'm glad I listened to it.
Developmental Stages of GPTs
Sometimes people will give GPT-3 a prompt with some examples of inputs along with the sorts of responses they'd like to see from GPT-3 in response to those inputs ("few-shot learning", right? I don't know what 0-shot learning you're referring to.)

No, that's zero-shot. Few shot is when you train on those instead of just stuffing them into the context.

It looks like mesa-optimization because it seems to be doing something like learning about new tasks or new prompts that are very different from anything its seen before, withou... (read more)

1John Maxwell2y
Thanks!
[AN #115]: AI safety research problems in the AI-GA framework

MAIEI also has an AI Ethics newsletter I recommend for those interested in the topic.

[AN #115]: AI safety research problems in the AI-GA framework
I actually expect that the work needed for the open-ended search paradigm will end up looking very similar to the work needed by the “AGI via deep RL” paradigm: the differences I see are differences in difficulty, not differences in what problems qualitatively need to be solved.

I'm inclined to agree. I wonder if there are any distinctive features that jump out?

[AN #114]: Theory-inspired safety solutions for powerful Bayesian RL agents

Regarding curriculum learning: I think its very neglected, and seems likely to be a core component of prosaic alignment approaches. The idea of a "basin of attraction for corrigibility (or other desirable properties)" seems likely to rely on appropriate choice of curriculum.

Load More