The Research Engineer job for the Alignment team is no longer open - is this because it's reached some threshold of applications? In any case might not be helpful to advertise!
Thanks for doing this though, the context is very useful (I've applied as RE to both).
The synthesis of these options would be an AGI research group whose plan consists of:
What do you think of such a plan?
I think this would be reasonable, but if the plan is taken up then it becomes a cost-benefit analysis of when Idea B should be deployed, which plausibly could be very aggressive, so it could easily boil down to just Idea B.
It's also worth noting that a research group with an AGI who want world governments to perform a pivotal act would need to be incredibly effective and persuasive. Their options would run a spectrum from normal public-channel and lobbying efforts to AGI-takes-over-the-world-behind-the-scenes (depending on sufficient capability), with a variety of AGI-assisted persuasion techniques in between. At some degree of AI/research group control over government, it's not clear if this would be an improvement over the original act. Demonstrating the power of AGI in a way that would force governments to listen would need to at least threaten a transformative act (self-driving cars, solving protein folding, passing normal Turing tests clearly aren't enough) and so the necessary levels of influence and demonstrated capability would be large (and demonstrating capability has obvious potential drawbacks in sparking arms races).
I mentioned it in my standalone post but I'll register a question here:
In the counterexamples for 'Strategy: train a reporter that is useful for another AI', the main difficulty is the ability for agents to hide information in human language somehow, given the many available degrees of freedom.
I grant that this is a big risk but one advantage we have is that if we trained multiple agents, they would all be encoding hidden information, but most likely they would all encode this extra information in different ways.
The question is, given multiple agents encoding hidden messages in the reporter's answers, can we find a training strategy which penalizes or removes any kind of information in the answers which is not common to all agents?
My suggestion, based on my idea of internal note-taking rather than reporting per-se, would be to take the 'notes' on the internal state written by one agent and feeding these to another agent. This seems like it would eventually converge on them sharing the same hidden encoding though, albeit with more difficulty. Apologies if not clear more spelled out in the full post, but yeah, are there better proposals, or reasons why this kind of strategy wouldn't work?
I wonder if the discussion of the scientific capabilities of e.g. GPT-3 would be more productive if it were anchored to some model of the wider scientific feedback loop in which it's situated?
Consider three scenarios:
As we go from A to C we see a decrease in the quality of the feedback loop, and with it an increasing need for general, rather than narrow intelligence. I would argue that even A should count as doing science, since it advances the state of the art knowledge about an important phenomena, and current models are clearly capable of doing so. C is clearly well beyond the capabilities of GPT-3 and also many well qualified, intelligent scientists, because the feedback loop is so poor. B is intermediate and I expect beyond GPT-3 but I'm not confident that current techniques couldn't provide value.
If you're interested in taking the point further, perhaps one of you could specify the loosest scientific feedback loop that you think the current paradigm of AI is capable of meaningfully participating in?
Turns out our methods are not actually very path-dependent in practice!
Yeah I get that's what Mingard et al are trying to show but the meaning of their empirical results isn't clear to me - but I'll try and properly read the actual paper rather than the blog post before saying any more in that direction.
"Flat minimum surrounded by areas of relatively good performance" is synonymous with compression. if we can vary the parameters in lots of ways without losing much performance, that implies that all the info needed for optimal performance has been compressed into whatever-we-can't-vary-without-losing-performance.
I get that a truly flat area is synonymous with compression - but I think being surrounded by areas of good performance is anti-correlated with compression because it indicates redundancy and less-than-maximal sensitivity.
I agree that viewing it as flat eigendimensions in parameter space is the right way to think about it, I still worry that the same concerns apply that maximal compression in this space is traded against ease of finding what would be a flat plain in many dimensions, but a maximally steep ravine in all of the other directions. I can imagine this could be investigated with some small experiments, or they may well already exist but I can't promise I'll follow up, if anyone is interested let me know.
Cheers for posting! I've got a question about the claim that optimizers compress by default, due to the entropy maximization-style argument given around 20:00 (apologies if you covered this, it's not easy to check back through a video):
Let's say that we have a neural network of width 100, which is trained on a dataset which could be trained to perfect accuracy on a network of width of only 30. If it compresses it into only 30 weights there's a 70-dimensional space of free parameters and we should expect a randomly selected solution to be of this kind.
I agree that if we randomly sample zero-loss weight configurations, we end up with this kind of compression, but it seems that any kind of learning we know how to do is dependent on the paths that one can take to reach it, and that abstracting this away can give very different results to any high-dimensional optimization that we actually know how to do.
Assuming that the network is parameterized by, say, float16s, maximal compression of the data would result in the output of the network being sensitive to the final bit of the weights in as many cases as possible, thereby leaving the largest number of free bits, so 16 bits of info would be compressed in to one weight, rather than spread among 3-4.
My intuition is that these highly compressed arrangements would be very sensitive to perturbations, and render them incredibly difficult to reach in practice (and also have a big problem with an unknown examples, and are therefore screened off by techniques like dropout and regularization). There is therefore a competing incentive towards minima which are easy to land on - probably flat minima surrounded by areas of relatively good performance. Further, I expect that these kind of minima tend to leverage the whole network for redundancy and flatness (not needing to depend tightly on the final bit of weights).
The properties of would be not just compression but some combination of compression and smoothness (smoothness being sort of a variant of compression where the final bits don't matter much) which would not result in some subset of the parameters having all the useful information.
If you agree that this is what happens, in what sense is there really compression, if the info is spread among multiple bits? Perhaps given the structure of NNs, we should expect to be able to compress by removing the last bits of weights as these are the easiest to leave free given the structure of training?
If you disagree I'd be curious to know where. I sense that Mingard et al shares your conclusion but I don't yet understand the claimed empirical demonstration.
tldr: optimization may compress by default, but learning seems to counteract this by choosing easy-to-find minima.
Yeah fully agreed.
I see John agrees with the 'one-time' label but it seems a bit too strong to me, especially if the kind of optimization is 'lets try a totally different approach', rather than continuing to train the same system, or focusing on exactly why it spoofed one sensor but not the other. Just to think it through:
There are three types of system that are important: type A which fails on the validation/holdout data, type B which succeeds on validation but not test/real-world data, and type C, which succeeds on both. We are looking for type C, and we use the validation data to distinguish A from either B or C.
Naively, waiting longer for a system that is not-A wouldn't have a bearing on whether it is B or C, but upon finding A, we know it is finding the strategy of spoofing sensors, and the more times we find A, the more we suspect this strategy is dominant, and suggests that partial spoofing (B) is more likely than no spoofing (C). Therefore, when we find not-A after a series of As, it is more likely to be B than if we found not-A on our first try.
I agree with the logic but it seems like our expectation of the B:C ratio will increase smoothly over time, if the holdout sensors are different to non-holdout ones, costly of spoof, and any leakage is minimized (maximizing initial expectations of C:B ratio) then finding not-A seems to be meaningful evidence in favor of C for a while.
Not to say that this solves ELK, but it seems like it should remain (ever weaker) evidence in favor of honesty for multiple iterations, though I can't say I know how steep the fall-off should be.
This could also be extended by having multiple levels of holdout data, the next level being only evaluated once we have sufficient confidence that it is honest (accounting for the declining level of evidence given by previous levels, with the assumption that there are other means of testing).
I was going to write something similar, and just wanted to add that this problem can be expected to get worse the more non-holdout sensors there are. If there were just a single non-holdout camera then spoofing only the one camera would be worthwhile - but if there were a grid of cameras with just a few being held out then it would likely be easiest to take an action that fools them all, like a counterfeit diamond.
This method would work best when there be whole modes of data which are ignored, and the work needed to spoof them is orthogonal to all non-holdout modes.
I've been looking at papers involving a lot of 'controlling for confounders' recently and am unsure about how much weight to give their results.
Does anyone have recommendations about how to judge the robustness of these kind of studies?
Also, I was considering doing some tests of my own based on random causal graphs, testing what happens to regressions when you control for a limited subset of confounders, varying the size/depth of graph and so on. I can't seem to find any similar papers but I don't know the area, does anyone know of similar work?