Do you know why 4x was picked? I understand that doing evals properly is a pretty substantial effort, but once we get up to gigantic sizes and proto-AGIs it seems like it could hide a lot. If there was a model sitting in training with 3x the train-compute of GPT4 I'd be very keen to know what it could do!

Hi Scott, thanks for this!

Yes I did do a fair bit of literature searching (though maybe not enough tbf) but very focused on sparse coding and approaches to learning decompositions of model activation spaces rather than approaches to learning models which are monosemantic by default which I've never had much confidence in, and it seems that there's not a huge amount beyond Yun et al's work, at least as far as I've seen.

Still though, I've not seen almost any of these which suggests a big hole in my knowledge, and in the paper I'll go through and add a lot more background to attempts to make more interpretable models.

Hi Charlie, yep it's in the paper - but I should say that we did not find a working CUDA-compatible version and used the scikit version you mention. This meant that the data volumes used are somewhat limited - still on the order of a million examples but 10-50x less than went into the autoencoders.

It's not clear whether the extra data would provide much signal since it can't learn an overcomplete basis and so has no way of learning rare features but it might be able to outperform our ICA baseline presented here, so if you wanted to give someone a project of making that available, I'd be interested to see it!

4mo42

Try decomposing the residual stream activations over a batch of inputs somehow (e.g. PCA). Using the principal directions as activation addition directions, do they seem to capture something meaningful?

It's not PCA but we've been using sparse coding to find important directions in activation space (see original sparse coding post, quantitative results, qualitative results).

We've found that they're on average more interpretable than neurons and I understand that @Logan Riggs and Julie Steele have found some effect using them as directions for activation pat...

Compute Thresholds: proposed rules to mitigate risk of a “lab leak” accident during AI training runs

4mo31

For the avoidance of doubt, this accounting should recursively aggregate transitive inputs.

What does this mean?

34mo

Suppose Training Run Z is a finetune of Model Y, and Model Y was the output of Training Run Y, which was already a finetune of Foundation Model X produced by Training Run X (all of which happened after September 2021). This is saying that not only Training Run Y (i.e. the compute used to produce one of the inputs to Training Run Z), but also Training Run X (a “recursive” or “transitive” dependency), count additively against the size limit for Training Run Z.

6mo10

Do you have a writeup of the other ways of performing these edits that you tried and why you chose the one you did?

In particular, I'm surprised by the method of adding the activations that was chosen because the tokens of the different prompts don't line up with each other in a way that I would have thought would be necessary for this approach to work, super interesting to me that it does.

If I were to try and reinvent the system after just reading the first paragraph or two I would have done something like:

- Take
*multiple*pairs of prompts that differ primari

39mo

Just added some more detail on this to the slides. The idea is that we have various advantages over the model during the training process: we can restart the search, examine and change beliefs and goals using interpretability techniques, choose exactly what data the model sees, etc.

Agree that the cited links don't represent a strong criticism of RLHF but I think there's an interesting implied criticism, between the mode-collapse post and janus' other writings on cyborgism etc that I haven't seen spelled out, though it may well be somewhere.

I see janus as saying that if you know how to properly use the raw models, then you can actually get much more useful work out of the raw models than the RLHF'd ones. If true, we're paying a significant alignment tax with RLHF that will only become clear with the improvement and take-up of wrappers...

10mo10

Interesting! I'm struggling to think what kind of OOD fingerprints for bad behaviour you (pl.) have in mind, other than testing fake 'you suddenly have huge power' situations which are quite common suggestions but v curious what you have in mind.

Also, think it's worth saying that the strength of the result connecting babbage to text-davinci-001 is stronger than that connecting ada to text-ada-001 (by logprob), so it feels like the first one shouldn't count that as a solid success.

I wonder whether you'd find a positive rather than negative correlation...

110mo

I would guess it's positive. I'll check at some point and let you know.

10mo10

I wanted to test out the prompt generation part of this so I made a version where you pick a particular input sequence and then only allow a certain fraction of the input tokens to change. I've been initialising it with a paragraph about COVID and testing how few tokens it needs to be able to change before it reliably outputs a particular output token.

Turns out it only needs a few tokens to fairly reliably force a single output, even within the context of a whole paragraph, eg *"typical people infected Majesty the virus will experience mild to moderate 74 i*...

I'd be interested to know how you estimate the numbers here, they seem quite inflated to me.

If 4 big tech companies were to invest $50B each in 2023 then, assuming average salary as $300k and 2:1 capital to salary then investment would be hiring about 50B/900K = 55,000 people to work on this stuff. For reference the total headcount at these orgs is roughly 100-200K.

50B/yr is also around 25-50% of the size of the total income, and greater than profits for most which again seems high.

Perhaps my capital ratio is way too low but I would find it hard to believe...

Yeah I agree demonstrating it and examining it is very important, especially in chain-of-thought where the existence of hidden information isn't a certainty.

They're a bit out of date but the original proposal is ELK Sub - Note-taking in internal rollouts and the multiple model protocol is in Note Taking Without Hidden Messages.

Haven't done any proper testing but it's high up on my agenda, would be interested in working out what tests would be best - will elaborate later.

1y41

For the ELK contest I was working on something similar, but where the problem is more acute (you're trying to force a MuZero-style reinforcement learner to compress its hidden state into words).

The methods I proposed to deal with this are:

- Cut down the null space by training a language model to rewrite the written words in a different format which preserves the meaning as closely as possible, so that the CoT reasoner is forced to respond to similar sentences in similar ways.
- Use multiple models which are trained separately but have to respond to each others'

31y

Putting what I think is the most important part of my reply first: I think research into mitigations is premature and instead demonstrating/measuring the phenomena should take priority.
However given that, I think I agree that these are all possible mitigations to the phenomena, in particular (rephrasing your points):
* Reducing the channel capacity, and thus the amount of information that can be transmitted
* Constraining the outputs, and thus reducing the degrees of freedom of the null space
* Adding noise (via dropout or otherwise) also reduces the channel capacity, and probably also has some desirable side effects (like increasing the diversity or creativity of reasoning)
* Introducing "barriers" past which optimization for hidden information becomes difficult (in your example, you use different models; I also think "stop gradients" in end-to-end training has similar effects, if weaker)
* If you can detect/localize this phenomena, you might be able to use interpretability like techniques to limit or constrain it
* I'm less certain about this, but possible future interpretability research will unlock this ability.
Also I think an important point here is that we can expect to see it with fairly simple techniques -- merely including sampled/generated data in the training set is sufficient, as opposed to it requiring a complicated reinforcement learning algorithm like MuZero.
Did you publish your proposal? I'd be interested in reading it.

1y31

I'm really interested to see this progress, it would feel very healthy if we could have a solid integrated definition of optimizer to work with.

I'm not sure I understand why you don't agree with the 'small' criterion for the target set. It seems that you should be able to say something about the likelihood of the target in the absence of any agent (or if the agent takes a max-ent distribution over actions or something), and that's the relevant notion of smallness, which then becomes large in the presence of the agent. Or is it that you expect it to be diff...

21y

Thank you for this. Yes, the problem is that (in some cases) we think it can sometimes be difficult to specify what the probability distribution would be without the agent. One strategy would be to define some kind of counterfactual distribution that would obtain if there were no agent, but then we need to have some principled way to get this counterfactual (which might be possible). I think this is easier in situations in which the presence of an agent/optimizer is only one possibility, in which case we have a defined probability distribution, conditional on there not being an agent. Perhaps that is all that matters (I am somewhat partial to this), but then I don't think of this as giving us a definition of an optimizing system (since, conditional on their being an optimizing system, there would cease to be an optimizing system---for a similar idea, see Vingean Agency).
I like your suggestions for connecting (1) and (3).
And thanks for the correction!

Suggestion:

Eliezer has huge respect in the community; he has strong, well thought-out opinions (often negative) on a lot of the safety research being done (with exceptions, Chris Olah mentioned a few times); but he's not able to work full time on research directly (or so I understand, could be way off).

Perhaps he should institute some kind of prize for work done, trying to give extra prestige and funding to work going in his preferred direction? Does this exist in some form without my noticing? Is there a reason it'd be bad? Time/energy usage for Eliezer combined with difficulty of delegation?

2y5

The Research Engineer job for the Alignment team is no longer open - is this because it's reached some threshold of applications? In any case might not be helpful to advertise!

Thanks for doing this though, the context is very useful (I've applied as RE to both).

22y

Should be fixed now!

1[comment deleted]2y

2y130

The synthesis of these options would be an AGI research group whose plan consists of:

- Develop safe AGI.
- Try to convince world governments to perform some such pivotal act (Idea A) - note that per current institutions this needs consensus and strong implementation across all major and medium tech powers.
- Have a back-up plan, if AGI research is proliferating without impending shutdown, to shut down world research unilaterally (Idea B).

What do you think of such a plan?

I think this would be reasonable, but if the plan is taken up then it becomes a cost-benefit an...

2y1

I mentioned it in my standalone post but I'll register a question here:

In the counterexamples for 'Strategy: train a reporter that is useful for another AI', the main difficulty is the ability for agents to hide information in human language somehow, given the many available degrees of freedom.

I grant that this is a big risk but one advantage we have is that if we trained multiple agents, they would all be encoding hidden information, but most likely they would all encode this extra information in different ways.

The question is, given multiple agents encod...

2y5

I wonder if the discussion of the scientific capabilities of e.g. GPT-3 would be more productive if it were anchored to some model of the wider scientific feedback loop in which it's situated?

Consider three scenarios:

**A**: A model trained to predict the shapes of proteins from their DNA sequences.**B:**A model with access to custom molecule synthesizer and high-volume, high-quality feedback about the scientific value of its text production, trained to write papers with scientific value about the nature of the molecules produced.**C:**A model given the resources of

Turns out our methods are not actually very path-dependent in practice!

Yeah I get that's what Mingard et al are trying to show but the meaning of their empirical results isn't clear to me - but I'll try and properly read the actual paper rather than the blog post before saying any more in that direction.

..."Flat minimum surrounded by areas of relatively good performance" is

synonymous withcompression. if we can vary the parameters in lots of ways without losing much performance, thatimplies thatall the info needed for optimal performance has been compresse

2y3

Cheers for posting! I've got a question about the claim that optimizers compress by default, due to the entropy maximization-style argument given around 20:00 (apologies if you covered this, it's not easy to check back through a video):

Let's say that we have a neural network of width 100, which is trained on a dataset which could be trained to perfect accuracy on a network of width of only 30. If it compresses it into only 30 weights there's a 70-dimensional space of free parameters and we should expect a randomly selected solution to be of this kind. ...

32y

This is where Mingard et al come in. One of their main results is that SGD training on neural nets does quite well approximate just-randomly-sampling-an-optimal-point. Turns out our methods are not actually very path-dependent in practice!
There is a mismatch between your intuition and the implications of "flat minima surrounded by areas of relatively good performance".
Remember, the whole point of the "highly compressed arrangements" is that we only need to lock in a few parameter values in order to get optimal behavior; once those few values are locked in, the rest of the parameters can mostly vary however they want without screwing stuff up. "Flat minimum surrounded by areas of relatively good performance" is synonymous with compression: if we can vary the parameters in lots of ways without losing much performance, that implies that all the info needed for optimal performance has been compressed into whatever-we-can't-vary-without-losing-performance.
Now, your intuition is correct in the sense that info may be spread over many parameters; the relevant "ways to vary things" may not just be "adjust one param while holding others constant". For instance, it might be more useful to look at parameter variation along local eigendirections of the Hessian. Then the claim would be something like "flat optimum = performance is flat along lots of eigendirections, therefore we can project the parameter-values onto the non-flat eigendirections and those projections are the 'compressed info'". (Tbc, I still don't know what the best way is to characterize this sort of thing, but eigendirections are an obvious approximation which will probably work.)

I see John agrees with the 'one-time' label but it seems a bit too strong to me, especially if the kind of optimization is 'lets try a totally different approach', rather than continuing to train the same system, or focusing on exactly why it spoofed one sensor but not the other. Just to think it through:

There are three types of system that are important: type A which fails on the validation/holdout data, type B which succeeds on validation but not test/real-world data, and type C, which succeeds on both. We are looking for type C, and we use the validatio...

52y

I agree. There's nothing magical about "once". I almost wrote "once or twice", but it didn't sit well with the level of caution I would prefer be the norm. While your analysis seems correct, I am worried if that's the plan.
I think a safety team should go into things with the attitude that this type of thing is important a last-line-of-defense, but should never trigger. The plan should involve a strong argument that what's being build is safe. In fact if this type of safeguard gets triggered, I would want the policy to be to go back to the drawing board, take the new information into account, and come up with a more well-argued plan. The new plan can have "never to be used" safeguards like this, but hopefully it has more and different ones this time.
If, on the other hand, a safety team goes in with the idea that John's safeguard can be iterated a few times as you argue, then I anticipate them fooling themselves by iterating too many times and coming up with a plan that accidentally skirts the safeguard in some hard-to-notice way.
(I have no reason to expect these sorts of anticipations to be calibrated; I'm just thinking cautiously here.)

I was going to write something similar, and just wanted to add that this problem can be expected to get worse the more non-holdout sensors there are. If there were just a single non-holdout camera then spoofing only the one camera would be worthwhile - but if there were a grid of cameras with just a few being held out then it would likely be easiest to take an action that fools them all, like a counterfeit diamond.

This method would work best when there be whole *modes* of data which are ignored, and the work needed to spoof them is orthogonal to all non-holdout modes.

I've been looking at papers involving a lot of 'controlling for confounders' recently and am unsure about how much weight to give their results.

Does anyone have recommendations about how to judge the robustness of these kind of studies?

Also, I was considering doing some tests of my own based on random causal graphs, testing what happens to regressions when you control for a limited subset of confounders, varying the size/depth of graph and so on. I can't seem to find any similar papers but I don't know the area, does anyone know of similar work?

2y3

This employee has 100 million dollars, approximately 10,000x fewer resources than the hedge fund. Even if the employee engaged in unethical business practices to achieve a 2x higher yearly growth rate than their former employer, it would take 13 years for them to have a similar amount of capital.

I think it's worth being explicit here about whether increases in resources under control are due to appreciation of existing capital or allocation of new capital.

If you're talking about appreciation, then if the firm earns 5% returns on average and the rogue...

2y1

Cheers for the post, I find the whole series fascinating.

One thing I was particularly curious about is how these 'proposals' are made. Do you have a picture of what kind of embedding is used to present a potential action?

For example, is a proposal encoded in the activations of set of neurons that are isomorphic to the motor neurons and it could then propose tightening a set of finger muscles through specific neurons? Or is the embedding jointly learned between the two in some large unstructured connection, or smaller latent space, or something completely different?

32y

The least-complicated case (I think) is: I (tentatively) think that the hippocampus is more-or-less a lookup table with a finite number of discrete thoughts / memories / locations / whatever (the type of content in different in different species), and a "proposal" is just "which of the discrete things should be activated right now".
A medium-difficulty case is: I think motor cortex stores a bunch of sequences of motor commands which execute different common action sequences. (I'm a believer in the Graziano theory that primary motor cortex, secondary motor cortex, supplementary motor cortex, etc. etc., are all doing the same kind of thing and should be lumped together.) The exact details of the data structures that the brain uses to store these sequences of motor commands are controversial and I don't want to get into it here…
Then the hardest case is the areas that "think thoughts", spawn new ideas, etc., all the cool stuff that leads to human intelligence. (e.g. dorsolateral prefrontal cortex I think.) Things like "I'm going to go to the store" or "what if I differentiate both sides of the equation?". Those things are clearly not isomorphic to a sequence of motor commands. It's higher-level than that. Again, the exact data structures and algorithms involved in representing and searching for these "thoughts" is a very big and controversial topic that I don't want to get into here…

Another little update, speed issue solved for now by adding SymPy's fortran wrappers to the derivative calculations - calculating the SVD isn't (yet?) the bottleneck. Can now quickly get results from 1,000+ step simulations of 100s of particles.

Unfortunately, even for the pretty stable configuration below, the values are indeed exploding. I need to go back through the program and double check the logic but I don't think it should be chaotic, if anything I would expect the values to hit zero.

It might be that there's some kind of quasi-chaotic behaviou...

22y

If the wheels are bouncing off each other, then that could be chaotic in the same way as billiard balls. But at least macroscopically, there's a crapton of damping in that simulation, so I find it more likely that the chaos is microscopic. But also my intuition agrees with yours, this system doesn't seem like it should be chaotic...

Been a while but I thought the idea was interesting and had a go at implementing it. Houdini was too much for my laptop, let alone my programming skills, but I found a simple particle simulation in pygame which shows the basics, can see below.

Planned next step is to work on the run-time speed (even this took a couple of minutes run, calculating the frame-to-frame Jacobian is a pain, probably more than necessary) and then add some utilities for creatin...

33y

Nice!
A couple notes:
* Make sure to check that the values in the jacobian aren't exploding - i.e. there's not values like 1e30 or 1e200 or anything like that. Exponentially large values in the jacobian probably mean the system is chaotic.
* If you want to avoid explicitly computing the jacobian, write a method which takes in a (constant) vector u and uses backpropagation to return ∇x0(xt⋅u). This is the same as the time-0-to-time-t jacobian dotted with u, but it operates on size-n vectors rather than n-by-n jacobian matrices, so should be a lot faster. Then just wrap that method in a LinearOperator (or the equivalent in your favorite numerical library), and you'll be able to pass it directly to an SVD method.
In terms of other uses... you could e.g. put some "sensors" and "actuators" in the simulation, then train some controller to control the simulated system, and see whether the data structures learned by the controller correspond to singular vectors of the jacobian. That could make for an interesting set of experiments, looking at different sensor/actuator setups and different controller architectures/training schemes to see which ones do/don't end up using the singular-value structure of the system.

3y4

Reading this after Steve Byrnes' posts on neuroscience gives a potentially unfortunate view on this.

The general impression is that the a lot of our general understanding of the world is carried in the neocortex which is running a consistent statistical algorithm and the fact that humans converge on similar abstractions about the world could be explained by the statistical regularities of the world as discovered by this system. At the same time, the other parts of the brain have a huge variety of structures and have functions which are the products of evolu...

53y

Here's one fairly-standalone project which I probably won't get to soon. It would be a fair bit of work, but also potentially very impressive in terms of both showing off technical skills and producing cool results.
Short somewhat-oversimplified version: take a finite-element model of some realistic objects. Backpropagate to compute the jacobian of final state variables with respect to initial state variables. Take a singular value decomposition of the jacobian. Hypothesis: the singular vectors will roughly map to human-recognizable high-level objects in the simulation (i.e. the nonzero elements of any given singular vector should be the positions and momenta of each of the finite elements comprising one object).
Longer version: conceptually, we imagine that there's some small independent Gaussian noise in each of the variables defining the initial conditions of the simulation (i.e. positions and momenta of each finite element). Assuming the dynamics are such that the uncertainty remains small throughout the simulation - i.e. the system is not chaotic - our uncertainty in the final positions is then also Gaussian, found by multiplying the initial distribution by the jacobian matrix. The hypothesis that information-at-a-distance (in this case "distance" = later time) is low-dimensional then basically says that the final distribution (and therefore the jacobian) is approximately low-rank.
In order for this to both work and be interesting, there are some constraints on both the system and on how the simulation is set up. First, "not chaotic" is a pretty big limitation. Second, we want the things-simulated to not just be pure rigid-body objects, since in that case it's pretty obvious that the method will work and it's not particularly interesting. Two potentially-interesting cases to try:
* Simulation of an elastic object with multiple human-recognizable components, with substantial local damping to avoid small-scale chaos. Cloth or jello or a sticky hand or someth

I agree that this is the biggest concern with these models, and the GPT-n series running out of steam wouldn't be a huge relief. It looks likely that we'll have the first human-scale (in terms of parameters) NNs before 2026 - Metaculus, 81% as of 13.08.2020.

Does anybody know of any work that's analysing the rate at which, once the first NN crosses the n-parameter barrier, other architectures are also tried at that scale? If no-one's done it yet, I'll have a look at scraping the data from Papers With Code's databases on e.g. I...

3y4

Hey Daniel, don't have time for a proper reply right now but am interested in talking about this at some point soon. I'm currently in UK Civil Service and will be trying to speak to people in their Office for AI at some point soon to get a feel for what's going on there, perhaps plant some seeds of concern. I think some similar things apply.

13y

Sure, I'd be happy to talk. Note that I am nowhere near the best person to talk to about this; there are plenty of people who actually work at an AI project, who actually talk to AI scientists regularly, etc.

I think this this points to the strategic supremacy of relevant infrastructure in these scenarios. From what I remember of the battleship era, having an advantage in design didn't seem to be a particularly large advantage - once a new era was entered, everyone with sufficient infrastructure switches to the new technology and an arms race starts from scratch.

This feels similar to the AI scenario, where technology seems likely to spread quickly through a combination of high financial incentive, interconnected social networks, state-sponsored espionage e...

4y7

Apologies if this is not the discussion you wanted, but it's hard to engage with comparability classes without a framework for how their boundaries are even minimally plausible.

Would you say that all types of discomfort are comparable with higher quantities of themselves? Is there always a marginally worse type of discomfort for any given negative experience? So long as both of these are true (and I struggle to deny them) then transitivity seems to connect the entire spectrum of negative experience. Do you think there is a way to remove the transitivity of comparability and still have a coherent system? This, to me, would be the core requirement for making dust specks and torture incomparable.

34y

I agree that delineating the precise boundaries of comparability classes is a uniquely challenging task. Nonetheless, it does not mean they don't exist--to me your claim feels along the same lines as classical induction "paradoxes" involving classifying sand heaps. While it's difficult to define exactly what a sand heap is, we can look at many objects and say with certainty whether or not they are sand heaps, and that's what matters for living in the world and making empirical claims (or building sandcastles anyway).
I suspect it's quite likely that experiences you may be referring to as "higher quantities of themselves" within a single person are in fact qualitatively different and no longer comparable utilities in many cases. Consider the dust specks: they are assumed to be minimally annoying and almost indetectable to the bespeckèd. However, if we even slightly upgrade them so as to cause a noticeable sting in their targeted eye, they appear to reach a whole different level. I'd rather spend my life plagued by barely noticeable specks (assuming they have no interactions) than have one slightly burn my eyeball.

I've realised that you've gotta be careful with this method because when you find a trichromatic subtriangle of the original, it won't necessarily have the property of only having points of two colours along the edges, and so may not in fact contain a point that maps to the centre.

This isn't a problem if we just increase the number n by which we divide the whole triangle instead of recursively dividing subtriangles. Unfortunately now we're not reducing the range of co-ords where this fixed point must be, only finding a triad of ar...

35y

Yeah, you're right. That breaks the proof. I don't know how to deal with it yet.

5y7

Cleanest solution I can find for #8:

Also, if we have a proof for #6 there's a pleasant method for #7 that should work in any dimension:

We take our close*d convex set tha*t has the bounded function . We take a triangle that covers so that any point in is also in .

Now we define a new function such that where is the function that maps to the nearest point in .

By #6 we know that has a fixed point, since is continuous. We know that the fixed point of cannot lie outside because th...

5y3

On my approach:

I constructed a large triangle around the convex shape with the center somewhere in the interior. I then projected each point in the convex shape from the center towards the edge of the triangle in a proportional manner. ie. The center stays where it is, the points on the edge of the convex shape are projected to the edge of the triangle and a point 1/x of the distance from the center to the edge of the convex shape is 1/x of the distance from the center to the edge of the triangle.

Yeah agreed, in fact I don't think you even need to continually bisect, you can just increase n indefinitely. Iterating becomes more dangerous as you move to higher dimensions because an n dimensional simplex with n+1 colours that has been coloured according to analogous rules doesn't necessarily contain the point that maps to zero.

On the second point, yes I'd been assuming that a bounded function had a bounded gradient, which certainly isn't true for say sin(x^2), the final step needs more work, I like the way you did it in the proof below.

25y

Here's a messy way that at least doesn't need too much exhaustive search:

First let's separate all of the red nodes into groups so that within each group you can get to any other node in that group only passing through red nodes, but not to red nodes in any other group.

Now, we trace out the paths that surround these groups - they immediately look like the paths from Question 1 so this feels like a good start. More precisely, we draw out the paths such that each vertex forms one side of a triangle that has a blue node at its opposite corner. ...

I was able to get at least (I think) close to proving 2 using Sperner's Lemma as follows:

You can map the continuous function f(x) to a path of the kind found in Question 1 of length n+1

by evaluating f(x) at x=0, x=1 and n-1 equally spaced divisions between these two points and setting a node as blue if f(x) < 0 else as green.

By Sperner's Lemma there is an odd, and therefore non-zero number of b-g vertices. You can then take any b-g pair of nodes as the starting points for a new path and repeat the process. After k iterations you have two v...

5y3

I'm having trouble understanding why we can't just fix in your proof. Then at each iteration we bisect the interval, so we wouldn't be using the "full power" of the 1-D Sperner's lemma (we would just be using something close to the base case).

Also if we are only given that is continuous, does it make sense to talk about the gradient?

Could you elaborate on what it would mean to demonstrate 'savannah-to-boardroom' transfer? Our architecture was selected for in the wilds of nature, not our training data. To me it seems that when we use an architecture designed for language translation for understanding images we've demonstrated a similar degree of transfer.

I agree that we're not yet there on sample efficient learning in new domains (which I think is more what you're pointing at) but I'd like to be clearer on what benchmarks would show this. For example, how well GPT-4 can integrate a new domain of knowledge from (potentially multiple epochs of training on) a single textbook seems a much better test and something that I genuinely don't know the answer to.