All of gwern's Comments + Replies

OP came to mind while reading "Building A Virtual Machine inside ChatGPT":

...We can chat with this Assistant chatbot, locked inside the alt-internet attached to a virtual machine, all inside ChatGPT's imagination. Assistant, deep down inside this rabbit hole, can correctly explain us what Artificial Intelligence is.

It shows that ChatGPT understands that at the URL where we find ChatGPT, a large language model such as itself might be found. It correctly makes the inference that it should therefore reply to these questions like it would itself, as it is it

... (read more)

An additional one: "reality is the first place the AI is deployed in narrow tool-like ways and trained on narrow specialized datasets which could not elicit the capabilities the AI started off with".

At least in the current paradigm, it looks like generalist models/archs will precede hyperspecialized trained-from-scratch models/archs (the latter of which can only be developed given the former). So there will be an inherent, massive, train-test distribution shift across many, if not most, model deployments - especially early on, in the first deployments (whi... (read more)

They're also finding that inverse scaling on these tasks goes away with chain-of-thought prompting

So, like some of the Big-Bench PaLM results, these are more cases of 'hidden scaling' where quite simple inner-monologue approaches can show smooth scaling while the naive pre-existing benchmark claims that there are no gains with scale?

1Ethan Perez19d
Yup

xuan:

Fascinating evidence that GPT-3 concentrates probability mass on certain completions after fine-tuning on human feedback (ie. RLHF).

I suspect this is because RLHF elicits a singular scale of "goodness" judgements from humans, instead of a plurality of "goodness-of-a-kind" judgements.

One way to interpret language models is as mixtures of conversational agents: they first sample some conversational goal, then some policy over words, conditioned on that goal.

... (read more)
2janus22d
Never in history has an AI been roasted so hard. Heheheh +1. And I expect runtime conditioning approaches to become more effective with scale as "meta learning" capacities increase.

Big-Bench would appear to provide another instance of this in the latest PaLM inner-monologue paper, "Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them", Suzgun et al 2022: they select a subset of the hardest feasible-looking BIG-Bench tasks, and benchmark PaLM on them. No additional training, just better prompting on a benchmark designed to be as hard as possible. Inner-monologue prompts, unsurprisingly by this point, yields considerable improvement... and it also changes the scaling for several of the benchmarks - what looks like a ... (read more)

The smooth graphs seem like good evidence that there are much smoother underlying changes in the model, and that the abruptness of the change is about behavior or evaluation rather than what gradient descent is learning.

That does not seem true to me and as much of a leap as OP. A priori, if I see a smooth curve in one metric and a discontinuous or abrupt change in another, I do not see how that should make me more confident that it is 'about behavior or evaluation'. Why should I conclude that? Why can't it reflect a non-smooth underlying change in the m... (read more)

If perplexity on a task is gradually decreasing then I think that's probably produced some underlying gradual change in the model (which may be the sum of a ton of tiny discrete changes).

If accuracy and log loss are both improving, I think that's most likely due to the same underlying phenomenon. That's not nearly as obvious---it could be that there are two separate phenomena, and one gives rise to gradual improvements in perplexity without affecting accuracy while the other gives rise to abrupt improvements in accuracy without reflecting perplexity---but ... (read more)

Yes, you could definitely have misleading perplexities, like improving on a subset which is rare but vital and does not overcome noise in the evaluation (you are stacking multiple layers of measurement error/variance when you evaluate a single checkpoint on a single small heldout set of datapoints); after all, this is in fact the entire problem to begin with, that our overall perplexity has very unclear relationships to various kinds of performance, and so your overall Big-Bench perplexity would tell you little about whether there are any jaggies when you ... (read more)

2gwern2mo
Big-Bench would appear to provide another instance of this in the latest PaLM inner-monologue paper, "Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them", Suzgun et al 2022 [https://arxiv.org/abs/2210.09261#google]: they select a subset of the hardest feasible-looking BIG-Bench tasks, and benchmark PaLM on them. No additional training, just better prompting on a benchmark designed to be as hard as possible. Inner-monologue prompts, unsurprisingly by this point, yields considerable improvement... and it also changes the scaling for several of the benchmarks [https://arxiv.org/pdf/2210.09261.pdf#page=6] - what looks like a flat scaling curve with the standard obvious 5-shot benchmark prompt can turns out to be a much steeper curve as soon as they use the specific chain-of-thought prompt. (For example, "Web of Lies" goes from a consistent random 50% at all model sizes to scaling smoothly from ~45% to ~100% performance.) And I don't know any reason to think that CoT is the best possible inner-monologue prompt for PaLM, either. "Sampling can show the presence of knowledge but not the absence."

I'm not sure that makes sense or is justified by anything here either. You aren't looking at all the other lines. You are selectively presenting the jagged lines' counterparts which are smooth (just like the overall perplexity is smooth), but you don't show the flatlined lines' counterparts are flatline or indeed in any way different-looking. (The Wason selection test comes to mind here.) Maybe all the perplexities look similar in being smooth, and if you shuffled them, no one would be able to tell you which perplexity line matched up with which jag or non... (read more)

5leogao3mo
I don't think we can even conclude for certain that a lack of measured loglikelihood improvement implies that it won't, though it is evidence. Maybe the data used to measure the behavior doesn't successfully prompt the model to do the behavior, maybe it's phrased in a way the model recognizes as unlikely and so at some scale the model stops increasing likelihood on that sample, etc; as you would say, prompting can show presence but not absence.

The authors actually observe smooth increases in answer log-likelihood, even for tasks which showed emergent behavior according to the natural performance metric for the task (e.g. accuracy). These results are evidence that we can predict that emergent behaviors will occur in the future before models are actually “capable” of those behaviors.

So by what perplexity should one predict each of those having a sharp left turn at future scale-ups, exactly? What is the critical point on each smooth line from which you think you can predict the abrupt jagged line, and why do you think these two plots show that one can be predicted from the other, instead of showing the opposite?

The smooth graphs seem like good evidence that there are much smoother underlying changes in the model, and that the abruptness of the change is about behavior or evaluation rather than what gradient descent is learning. (Even on these relatively narrow tasks, which are themselves much more abrupt than averages across many sub-tasks.) That's useful if your forecasts are based on trend extrapolation, and suggests that if you want to make forecasts you should be looking at those smoother underlying changes prior to the model performing well on the task.

Predi... (read more)

6Ethan Perez3mo
Updated the post to clarify:

Interesting new paper: "What Can Transformers Learn In-Context? A Case Study of Simple Function Classes", Garg et al 2022:

In-context learning refers to the ability of a model to condition on a prompt sequence consisting of in-context examples (input-output pairs corresponding to some task) along with a new query input, and generate the corresponding output. Crucially, in-context learning happens only at inference time without any parameter updates to the model. While large language models such as GPT-3 exhibit some ability to perform in-context learning,

... (read more)

(Is that just because they get attacked and killed by other chimp groups?)

1Beth Barnes4mo
My impression is that they don't have the skills needed for successful foraging. There's a lot of evidence for some degree of cultural accumulation in apes and e.g. macaques. But I haven't looked into this specific claim super closely.

For instance, it seems plausible that if “adding arabic numerals” and “translating words into arabic numerals” are two groups but “adding numbers written as words” is not, performance on the latter could nonetheless develop smoothly as the model gets better at the others. It would certainly be weird if performance ”adding numbers written as words” advanced as a sudden leap in this case.

I wouldn't say this is weird. This is kind of the point of meta-learning, or 'transfer' in a broad sense: you train on X, and Y gets better! Or look at emergent capabilit... (read more)

2Adam Jermyn5mo
I'm not saying that the knowledge doesn't transfer, I'm saying it would seem weird if it transferred sharply. Specifically, if task Z is composed of performing task X then task Y, I would expect improving X to improve Z, and I would expect improving Y to improve Z, and I would expect P(Z performed correctly) to be given by P(X performed correctly) and P(Y performed correctly). I think that means Z will improve a bit more sharply than either X or Y, but not drastically so? But I could absolutely be wrong here! Real models do things undreamt of in theory. The first part is what I'm hoping for: I want it to have different dynamics and capabilities, at least at intermediate stages... it's fine if it eventually gets to the same place. The second part would definitely be bad, if only because it's a heavy alignment tax and if this incurs a large tax it's a non-starter. Thanks for your intuition around this! That indeed seems bad. And to make sure I've got it right, the intuition here is that the model strongly "wants" to learn the suppressed features (because they're very instrumental on the simple loss)? I guess the other thing that could happen is that you've screwed the model up too badly by training it on this grouped loss, so that those features are really far out of reach. I'm not quite sure how to think about this. My takeaway is that to the extent this helps with safety, it's a brittle strategy, and it has a good chance of incurring too-large a performance penalty to be viable in a competitive world.

We could probably use a term or a phrase for this concept since it keeps coming up and is a fundamental problem. How about:

Any model simple enough to be interpretable is too simple to be useful.

Corollary:

Any model which appears both useful and interpretable is uninterpretable.

4Xuan (Tan Zhi Xuan)5mo
On the contrary, I think there exist large, complex, symbolic models of the world that are far more interpretable and useful than learned neural models, even if too complex for any single individual to understand, e.g.: - The Unity game engine (a configurable model of the physical world) - Pixar's RenderMan renderer (a model of optics and image formation) - The GLEAMviz epidemic simulator (a model of socio-biological disease spread at the civilizational scale) Humans are capable of designing and building these models, and learning how to build/write them as they improve their understanding of the world. The difficult part is how we can recapitulate that ability -- program synthesis is only in its infancy in it's ability to do so, but IMO contemporary end-to-end deep learning methods seem unlikely to deliver here if want both interpretability and usefulness.

If anyone was wondering whether DM planned to follow it up in the obvious way because of the obvious implications of its obvious generality and obvious scalability, Hassabis says on the Fridman podcast: " it's just the beginning really, it's our most general agent one could call it so far but um you know that itself can be scaled up massively more than we've done so far obviously we're in the in the middle of doing that."

One parallel case that occurs to me is Anthropic using their GPT-like to try to imitate the COMPAS prediction of criminal offending, which is a regression problem too; then in the appendix, they experiment with a movie recommender system:

Figure 8 shows that language models smoothly decrease in the standard Root Mean Square Error (RMSE, lower is better) metric on the widely used Movielens 1M movie recommendation system task [31] as they increase in size. The smallest model achieves a significantly better RMSE (1.06) than chance (RMSE 1.91), and the larges

... (read more)

which is why I prefer the version that wants to see "Over the last N evaluations, each gate evaluated to T at least q times and to F at least q times, where q << N."

Yeah, I skipped over that because I don't see how one would implement that. That doesn't sound very differentiable? Were you thinking of perhaps some sort of evolutionary approach with that as part of a fitness function? Even if you have some differentiable trick for that, it's easier to explain my objections concretely with 50%. But I don't have anything further to say about that at t... (read more)

1Adam Jermyn6mo
That would work, yeah. I was thinking of an approach based on making ad-hoc updates to the weights (beyond SGD), but an evolutionary approach would be much cleaner!

What work is it doing, if it always outputs "this is a dog"?

My point is that, like in the AI koan, a random circuit, or a random NN, still does something. Like, if you feed in your dog photos, it'll start off predicting 1% for this one, 25.78% for that one, 99.76% for this other one... This is just because it is filled with random parameters at initialization and when you feed in your photos, each neuron computes something. Something totally nonsensical, but something nonetheless, and during that something, each neuron will have a distribution of activa... (read more)

2Adam Jermyn6mo
Ok, I see. Thanks for explaining! One thing to note, which might be a technical quibble, is that I don't endorse the entropy version of this prior (which is the one that wants 50/50 activations). I started off with it because it's simpler, but I think it breaks for exactly the reasons you say, which is why I prefer the version that wants to see "Over the last N evaluations, each gate evaluated to T at least q times and to F at least q times, where q << N." This is very specifically so that there isn't a drive to unnaturally force the percentages towards 50% when the true input distribution is different from that. Setting that aside: I think what this highlights is that the translation from "a prior over circuits" to "a regularizer for NN's" is pretty nontrivial, and things that are reasonably behaved in one space can be very bad in the other. If I'm sampling boolean circuits from a one-gate trace prior I just immediately find the solution of 'they're all dogs, so put a constant wire in'. Whereas with neural networks we can't jump straight to that solution and may end up doing more contrived things along the way.

I see. I guess I would then say a broader concern with this sort of regularization approach is that it incentivizes the network to move towards networks which are made up of a highly distributed representation or one which very easily permutes its weights (both of which are things that happen already with no particular incentive), right from the start, not because it is traveling towards a deceptive network - it's far too stupid and unoptimized for deception to even be an option at initialization - but because this sort of regularization impedes normal lea... (read more)

1Adam Jermyn6mo
I think I agree that the incentive points in that direction, though I'm not sure how strongly. My general intuition is that if certain wires in a circuit are always activated across the training distribution then something has gone wrong. Maybe this doesn't translate as well to neural networks (where there is more information conveyed than just 'True/False')? Does that suggest that there's a better way to implement this in the case of neural networks (maybe we should be talking about distributions of activations, and requesting that these be broad?). On the specifics, I think I'm confused as to what your dog classifier is. What work is it doing, if it always outputs "this is a dog"? More generally, if a subcircuit always produces the same output I would rather have it replaced with constant wires.

I'm not sure branch coverage metrics are not easily beaten. I'm reminded of the Day & Night CA which is Turing-complete yet completely symmetric, or reversible computing like flow models. Or think of interpreters: at the level of interpreter operations, a malicious program can use the same number and quantity of operations as a good one, and may well have to if it's doing things like weird machines or return-oriented programming - if you build your hack out of gadgets found in code already installed on the system like Firefox, then it's going to look ... (read more)

2Adam Jermyn6mo
I agree that many coverage-style metrics can be broken, probably easily, and that this includes the priors I described. I also think your explicit construction is right, and is a special case of a concern I mentioned in the post ("changing the location on the circuit where the deceptive conditional gets evaluated"). I don't think the specific construction you mention is terribly problematic because it requires doubling the size of the circuit, which is easy to penalize with a circuit complexity prior, so I'm much more worried about implicit cases, which I think could get the penalty down to just a few extra gates. That's why I only think the current trace priors I know of only buy you a few bits of optimization pressure away from deception (you have to work just a little harder to hide what you're doing). I'm currently looking for patches to this concern, but haven't found any yet with good properties (and maybe they just don't exist?). For instance, looking at correlations between two gates at a time handles the explicit construction, but is still vulnerable to this class of attack (e.g. if the circuit implements an interpreter it seems easy to arrange for any given calculation to land on a different part of the circuit in each training evaluation).

The Chinchilla scaling laws would predict faster progress.

(But we wouldn't observe that on these graphs because they weren't trained Chinchilla-style, of course.)

The two major points I take away:

  1. Scaling Just Works: as blase as we may now be at seeing 'lines go straight', I continue to be shocked in my gut that they do just keep going straight and something like Gato can be as straightforward as 'just train a 1.2b-param Transformer on half a thousand different tasks, homes, nbd' and it works exactly like you'd think and the scaling curve looks exactly like you'd expect. It is shocking how unshocking the results are conditional on a shocking thesis (the scaling hypothesis). So many S-curves and paradigms hit an ex

... (read more)

If anyone was wondering whether DM planned to follow it up in the obvious way because of the obvious implications of its obvious generality and obvious scalability, Hassabis says on the Fridman podcast: " it's just the beginning really, it's our most general agent one could call it so far but um you know that itself can be scaled up massively more than we've done so far obviously we're in the in the middle of doing that."

You should want to label and train on snippets that your classifier thinks is 50% correct, because that is how you maximmise information.

You don't want to 'maximize information' (or minimize variance). You want to minimize the number of errors you make at your decision-threshold. Your threshold is not at 50%, it's at 99%. Moving an evil sample from 50% to 0% is of zero intrinsic value (because you have changed the decision from 'Reject' to 'Reject' and avoided 0 errors). Moving an evil sample from 99.1% to 98.9% is very valuable (because you have change... (read more)

3Linda Linsefors7mo
The correct labeling of how violent a knifing is, is not 50.1%, or 49.9%. The correct label is 0 or 100%. There is no "ever so slightly" in the training data. The percentage is about the uncertanty of classifyer, it is not about degrees of violence in the sample. It it was the other way around, then I would mostsy agree with the current training scheem, as I said. If the model is well calibrated then half the samples would be safe, and half violent at 50%. Moving a up the safe one is helpfull. Decreesing missclassification of safe samples will increas the chance of outputing something safe. Decreesing the uncertanty from 50% to 0 for an unsafe sample don't do anything, for that sample. But it does help in learning good from bad in general, which is more important.

https://arxiv.org/abs/2204.06974 presents an OR-like construction which shields from gradients too, apparently, which might be of interest.

I said it was an analogy. You were discussing what intelligent human-level entities with inhibition control problems would hypothetically look like; well, as it happens, we do have such entities, in the form of sociopaths, and as it happens, they do not simply explode in every direction due to lacking inhibitions but often perform at high levels manipulating other humans until suddenly then they explode. This is proof of concept that you can naturally get such streaky performance without any kind of exotic setup or design. Seems relevant to mention.

1Rafael Harth8mo
Yes, but I didn't mean to ask whether it's relevant, I meant to ask whether it's accurate. Does the output of language models, in fact, feel like this? Seemed like something relevant to ask you since you've seen lots of text completions. And if it does, what is the reason for not having long timelines? If neural networks only solved the easy part of the problem, that implies that they're a much smaller step toward AGI than many argued recently.

An analogy that comes to mind is sociopathy. Closely linked to fear/reward insensitivity and impulsivity. Something you see a lot in case studies of diagnosed or accounts of people who look obviously like sociopaths is that they will be going along just fine, very competent and intelligent seeming, getting away with everything, until they suddenly do something which is just reckless, pointless, useless and no sane person could possibly think they'd get away with it. Why did they do X, which caused the whole house of cards to come tumbling down and is why you are now reading this book or longform investigative piece about them? No reason. They just sorta felt like it. The impulse just came to them. Like jumping off a bridge.

2Steve Byrnes8mo
Huh. I would have invoked a different disorder. I think that if we replace the Thought Assessor & Steering Subsystem with the function “RPE = +∞ (regardless of what's going on)”, the result is a manic episode, and if we replace it with the function “RPE = -∞ (regardless of what's going on)”, the result is a depressive episode. In other words, the manic episode would be kinda like the brainstem saying “Whatever thought you're thinking right now is a great thought! Whatever you're planning is an awesome plan! Go forth and carry that plan out with gusto!!!!” And the depressive episode would be kinda like the brainstem saying “Whatever thought you're thinking right now is a terrible thought. Stop thinking that thought! Think about anything else! Heck, think about nothing whatsoever! Please, anything but that thought!” My thoughts about sociopathy are here [https://www.lesswrong.com/posts/pfoZSkZ389gnz5nZm/the-intense-world-theory-of-autism#Bonus___Dim_world_theory_of_psychopathy___] . Sociopaths can be impulsive (like everyone), but it doesn't strike me as a central characteristic, as it is in mania. I think there might sometimes be situations where a sociopath does X, and onlookers characterize it as impulsive, but in fact it's just what the sociopath wanted to do, all things considered, stemming from different preferences / different reward function. For example, my impression is that sociopaths get very bored very easily, and will do something that seems crazy and inexplicable from a neurotypical perspective, but seems a good way to alleviate boredom from their own perspective. (Epistemic status: Very much not an expert on mania or depression, I've just read a couple papers. I've read a larger number of books and papers on sociopathy / psychopathy (which think are synonyms?), plus there were two sociopaths in my life that I got to know reasonably well, unfortunately. More of my comments about depression here [https://www.lesswrong.com/posts/jqTeghCJ2anMHPPjG/book
1Rafael Harth8mo
Do you think this describes language models?

but I am surprised that Chinchilla's curves uses an additive term that predicts that loss will never go below 1.69. What happened with the claims that ideal text-prediction performance was like 0.7?

Apples & oranges, you're comparing different units. Comparing token perplexities is hard when the tokens (not to mention datasets) differ. Chinchilla isn't a character-level model but BPEs (well, they say SentencePiece which is more or less BPEs), and BPEs didn't even exist until the past decade so there will be no human estimates which are in BPE units (... (read more)

2Daniel Kokotajlo8mo
Thanks Lanrian and Gwern! Alas that my quick-and-dirty method is insufficient.
2Daniel Kokotajlo8mo
You may be interested in this image [https://www.lesswrong.com/posts/YzbQeCiwoLBHrvAh4/?commentId=CpX9CqMgt5K4mjnmv#NYGFQ5K4m7LJt4oJ9] . I would be grateful for critiques; maybe I'm thinking about it wrong?
5Daniel Kokotajlo8mo
So then... If before we looked at the Kaplan scaling and thought e.g. 50% chance that +6 OOMs would be enough... now we correct for the updated scaling laws and think 50% chance that, what, +4 OOMs would be enough? How big do you think the adjustment would be? (Maybe I can work it out by looking at some of those IsoX graphs in the paper?)

That's my read. It continues the Kaplan scaling. The Kaplan scaling isn't wrong (everything really does scale that way if you train that way), it's just suboptimal. PaLM is not a surprise, neither in the compute cost nor in having capability-spikes (at least, if you've been paying attention and not handwaving them away).

The surprise here is perhaps showing how bad GB/DM communications are, that DM may have let GB piss away millions of dollars of TPU time. As one Googler put it, 'we find out about this stuff the same way you do - from Twitter'.

5Daniel Kokotajlo8mo
The difference between Chinchilla and Gopher was small but noticeable. Since the Kaplan and DM optimal scaling trajectories are like two lines with different slopes, should we perhaps expect the difference to get larger at greater scales?

Worth remembering that flips of the reward function do happen: https://openai.com/blog/fine-tuning-gpt-2/#bugscanoptimizeforbadbehavior ("Was this a loss to minimize or a reward to maximize...")

I agree with your framing, and I think it shows Paul is wrong, leaving aside the specifics of the cheetah thing. Looking back, humans pursued both paths, the path of selecting cheetahs (horses) and of using G to look for completely different paradigms that blow away cheetahs. (Since we aren't evolution, we aren't restricted to picking just one approach.) And we can see the results today: when was the last time you rode a horse?

If you had invested in 'the horse economy' a century ago and bought the stock of bluechip buggywhip manufacturers instead of aerosp... (read more)

2Paul Christiano9mo
I absolutely agree that there are usually multiple ways to do something, often one of them improves faster than current SOTA, and that the faster one often overtakes the slower improving one. I may be misunderstanding what you are taking away from the horses analogy. I don't think this undermines my point (or at least I don't yet see the connection).

That said, since I can't resist responding to random comments: are horses really being bred for sprinting as fast as they can for 20-30 seconds?

Yes, they were, and they still are. Cavalry charges are not that long*, and even if you want to absurdly nitpick on this exact basis where 20-30 seconds counts but 30-40s doesn't, well, as it happens, 20-30s is exactly about how long quarter horse races last. (Quarter horses, incidentally, now reach the low end of cheetah top speeds: 55mph, vs ~60mph. So depending on which pair of horses & cheetahs you compa... (read more)

2Paul Christiano9mo
I was saying that natural selection is not a human investor and behaves differently, responding to Eliezer saying "not as a metaphor but as simple historical fact, that's how it played out." I'm sorry if the exchange was unclear (but hopefully not surprising since it was a line of chat in a fast dialog written in about 3 seconds.) I think that you have to make an analogy because the situation is not obviously structurally identical and there are different analogies you could draw here and it was not clear which one he was making. I'm sorry I engaged about horse breeding (I think it was mostly a distraction).

They apparently reinvented RASP independently.

1Vivek Hebbar9mo
Nice! Do you know if the author of that post was involved in RASP?

as late as 1000 years ago, the fastest creatures on Earth are not humans, because you need even more G than that to go faster than cheetahs

...as a matter of fact there is no one investing in making better cheetahs

A little puzzled by Paul's history here. Humans invested exorbitant amounts of money and effort into making better cheetahs, in the sense of 'trying to be able to run much faster and become the fastest creatures on earth'; we call those manufactured cheetahs, "horses". For literally thousands of years, breeding horses has been a central preo... (read more)

I don't think this is relevant to the disanalogy I was trying to make, which was between natural selection and investors. It seems like I'm thinking about the comparison in a different way here. Hopefully this explains your puzzlement.

That said, since I can't resist responding to random comments: are horses really being bred for sprinting as fast as they can for 20-30 seconds? (Isn't that what cheetahs are so good at?) What is the military/agricultural/trade context in which that is relevant? Who cares other than horse racers? Over any of the distances where people are using horses I would expect them to be considerably faster than cheetahs even if both are unburdened. I don't know much about horses though.

Humans invested exorbitant amounts of money and effort into making better cheetahs, in the sense of 'trying to be able to run much faster and become the fastest creatures on earth'; we call those manufactured cheetahs, "horses".

I don't think Paul is talking about that. Consider the previous lines (which seem like they could describe animal breeding to me):

and you think that G doesn't help you improve on muscles and tendons?

until you have a big pile of it?

and Eliezer's response in the following lines:

the natural selection of cheetahs is investing in it

it's

... (read more)

Aligned AI is a benefit corporation dedicated to solving the alignment problem

Is this a UK or US public-benefit corporation?

Who are the other founders?

Who and how much are you capitalized for?

2Stuart Armstrong9mo
UK based currently, Rebecca Gorman other co-founder.
2Evan R. Murphy10mo
This page [ https://buildaligned.ai/get-involved/] says "We are located in Oxford, England." So I think they are a UK public-benefit corporation, but I could be mistaken.

A possible way this solution wouldn’t hold is if we consider the case of a compressed lookup table. Meta-learning could have a shorter description length because it compresses knowledge as opposed to a classical lookup table which adds a new member for each new corresponding input. A compressed lookup table could potentially have a shorter description length than GLUT, even one that grows logarithmically, implying that the speed prior wouldn’t necessarily favour meta-learning, but this requires further investigation.

Why isn't the learned 'compression' a... (read more)

It's worth noting that aside from the ridiculous situation where Googlers aren't allowed to name LaMDA (despite at least 5 published papers so far), Google has been very coy about MUM & Pathways (to the point where I'm still not sure if 'Pathways' is an actual model that exists, or merely an aspirational goal/name of a research programme). You also have the situation where models like LG's new 300b Exaone is described in a research paper which makes no mention of Exaone (the Korean coverage briefly mentions the L-Verse arch, but none of the English cov... (read more)

1Edouard Harris1y
This is an excellent point and it's indeed one of the fundamental limitations of a public tracking approach. Extrapolating trends in an information environment like this can quickly degenerate into pure fantasy. All one can really be sure of is that the public numbers are merely lower bounds — and plausibly, very weak ones.

Why do you have high confidence that catastrophic forgetting is immune to scaling, given "Effect of scale on catastrophic forgetting in neural networks", Anonymous 2021?


My Interpretation: We perform SGD updates on parameters while training a model. The claim is that the decision boundary does not change dramatically after an update. The safety implication is that we need not worry about an advanced AI system manoeuvring from one strategy to a completely different kind after an update SGD.

I also disagree with B1, gradual change bias. This seems to me t... (read more)

If you said "actually people will be using methods that flounder at a compute budget of 1e25 flops, but people will be doing AGI research with 1e30 flops, and the speedup will be > 1 OOM" then I agree that will give you a hard takeoff, but that's what I'm saying transformers aren't a good example of.

Why not? Here we have a pretty clean break: RNNs are not a tweak or two away from Transformers. We have one large important family of algorithms, which we can empirically demonstrate do not absorb usefully the compute which another later discretely differ... (read more)

The counterfactual depends on what other research people would have done and how successful it would have been. I don't think you can observe it "by simply looking."

That said, I'm not quite sure what counterfactual you are imagining. By the time transformers were developed, soft attention in combination with LSTMs was already popular. I assume that in your counterfactual soft attention didn't ever catch on? Was it proposed in 2014 but languished in obscurity and no one picked it up? Or was sequence-to-sequence attention widely used, but no one ever conside... (read more)

7Eliezer Yudkowsky1y
Want to +1 that a vaguer version of this was my own rough sense of RNNs vs. CNNs vs. Transformers.

And I’d reject LSTM → transformer or MoE as an example because the quantitative effect size isn’t that big.

But if something like that made the difference between “this algorithm wasn’t scaling before, and now it’s scaling,” then I’d be surprised.

Hold on, why doesn't LSTM→Transformer count? You've basically never seen a LSTM RNN larger than 100m parameters, I think, and the reason is that their scaling exponent looks bad and past 100m they're floundering: https://www.gwern.net/images/ai/gpt/2020-kaplan-figure7-rnnsvstransformers.png (Kaplan) Or https://a... (read more)

Copy-pasting the transfomer vs LSTM graph for reference (the one with the bigger gap):

If you told me that AGI looks like that graph, where you replace "flounders at 100M parameters" with "flounders at the scale where people are currently doing AGI research," then I don't think that's going to give you a hard takeoff.

If you said "actually people will be using methods that flounder at a compute budget of 1e25 flops, but people will be doing AGI research with 1e30 flops, and the speedup will be > 1 OOM" then I agree that will give you a hard takeoff, but t... (read more)

One thing to note is that you don't know how many games humans are playing in their head in some sense. We don't have access to that kind of information about our own algorithms. Even if you think we don't because we don't consciously experience/remember them, that's obviously wrong. Every time you have a thought pop out of nowhere or an eureka! moment from the incubation effect, or every time you have a Tetris effect dream (or all the experience-replay hippocampus neuroscience), you see how it feels to have powerful subconscious algorithms churning away o... (read more)

4Steve Byrnes1y
That's an interesting thought. My hunch is that hippocampal replay can't happen unconsciously because if the hippocampus broadcasts a memory at all, it broadcasts it broadly to the cortex including GNW [https://www.alignmentforum.org/posts/x4n4jcoDP7xh5LWLq/book-summary-consciousness-and-the-brain] . That's just my current opinion, I'm not sure if there's neuroscience consensus on that question. Here I'm sneaking in an assumption that "activity in the GNW" = "activity that you're conscious of". Edge-cases include times when there's stuff happening in the GNW, but it's not remembered after the fact (at least, not as a first-person episodic memory). Are you "conscious" during a dream that you forget afterwards? Are you "conscious" when you're 'blacked out' from drinking too much? I guess I'd say "yes" to both, but that's a philosophy question, or maybe just terminology. If we want more reasons that human-vs-EfficientZero comparisons are not straightforward, there's also the obvious fact that humans benefit from transfer-learning whereas EfficientZero starts with random weights.

It's a little bit less dramatic than that: the model-based simulation playing is interleaved with the groundtruth environment. It's more like you spend a year playing games in your head, then you play 1 30s bullet chess match with Magnus Carlsen (madeup ratio), then go back to playing in your head for another year. Or maybe we should say, "you clone yourself a thousand times, and play yourself at correspondence chess timescales for 1 game per pair in a training montage, and then go back for a rematch".

(The scenario where you play for 15 minutes at the begi... (read more)

4Rohin Shah1y
Also Against Mimicry [https://ai-alignment.com/against-mimicry-6002a472fc42]

What Moravec says is merely that $1k human-level compute will become available in the '2020s', and offers several different trendline extrapolations: only the most aggressive puts us at cheap human-level compute in 2020/2021 (note the units on his graph are in decades). On the other extrapolations, we don't hit cheap human-compute until the end of the decade. He also doesn't commit to how long it takes to turn compute into powerful systems, it's more of a pre-requisite: only once the compute is available can R&D really start, same way that DL didn't start instantly in 2010 when various levels of compute/$ were hit. Seeds take time to sprout, to use his metaphor.

2Eliezer Yudkowsky1y
As much as Moravec-1988 and Moravec-1998 sound like they should be basically the same people, a decade passed between them, and I'd like to note that Moravec may legit have been making an updated version of his wrong argument in 1998 compared to 1988 after he had a chance to watch 10 more years pass and make his earlier prediction look less likely.
4Vanessa Kosoy1y
We already know how much compute we have, so we don't need Moravec's projections for this? If Yudkowsky described Moravec's analysis correctly, then Moravec's threshold was crossed in 2008. Or, by "other extrapolations" you mean other estimates of human brain compute? Cotra's analysis is much more recent and IIUC she puts the "lifetime anchor" (a more conservative approach than Moravec's) at about one order of magnitude above the biggest models currently used. Now, the seeds take time to sprout, but according to Mark's model this time is quite short. So, it seems like this line of reasoning produces a timeline significantly shorter than the Plattian 30 years.

What we'd want is some neural-net style design that generates the coin reward and the move-right reward just from the game data, without any previous knowledge of the setting.

So you're looking for curriculum design/exploration in meta-reinforcement-learning? Something like Enhanced POET/PLR/REPAIRED but where it's not just moving-right but a complicated environment with arbitrary reward functions (eg. using randomly initialized CNNs to map state to 'reward')? Or would hindsight or successor methods count as they relabel rewards for executed trajectories... (read more)

2Stuart Armstrong8mo
Hey there! Sorry for the delay. $50 awarded to you for fastest good reference. PM me your bank details.

What do you think of Deepmind's new whoop-de-doo about doing research-level math assisted by GNNs?

2Paul Christiano1y
Not surprising in any of the ways that good IMO performance would be surprising.
Load More