Wiki Contributions

Comments

Yeah I agree demonstrating it and examining it is very important, especially in chain-of-thought where the existence of hidden information isn't a certainty.

They're a bit out of date but the original proposal is ELK Sub - Note-taking in internal rollouts and the multiple model protocol is in Note Taking Without Hidden Messages.

Haven't done any proper testing but it's high up on my agenda, would be interested in working out what tests would be best - will elaborate later.

For the ELK contest I was working on something similar, but where the problem is more acute (you're trying to force a MuZero-style reinforcement learner to compress its hidden state into words).

The methods I proposed to deal with this are:

  • Cut down the null space by training a language model to rewrite the written words in a different format which preserves the meaning as closely as possible, so that the CoT reasoner is forced to respond to similar sentences in similar ways.
  • Use multiple models which are trained separately but have to respond to each others' thoughts, making it more difficult to have a shared steganographic protocol.
  • Use something like dropout on the chains of thought, so that it can't hide information in the combination of multiple sentences.
  • Have some non-linguistic bits which we don't apply optimization to, but instead focus our intepretability efforts on. The hope is that this soaks up the highest-order bits that aren't easily expressible in language,so there's less need to hide them within words. Applying dropout to these is particularly necessary.

I'm really interested to see this progress, it would feel very healthy if we could have a solid integrated definition of optimizer to work with.

I'm not sure I understand why you don't agree with the 'small' criterion for the target set. It seems that you should be able to say something about the likelihood of the target in the absence of any agent (or if the agent takes a max-ent distribution over actions or something), and that's the relevant notion of smallness, which then becomes large in the presence of the agent. Or is it that you expect it to be difficult to properly specify what it means to have no agent or random decisions?

On the relationships between the three ways of defining acts - is there a trivial way of connecting (1) and (3) by saying that the action that the agent takes in (1) is just baked into the trajectory as some true fact about the trajectory that doesn't have consequences until the agent acts on it? Or instead of the action itself, we could 'bake in' the mapping from some information about the trajectory to the action. Either way we could see this as being determined initially or at the point of decision without a difference in the resulting trajectories.

'all dependent variables in the system of equations' - I think this should be 'independent'.

Suggestion:

Eliezer has huge respect in the community; he has strong, well thought-out opinions (often negative) on a lot of the safety research being done (with exceptions, Chris Olah mentioned a few times); but he's not able to work full time on research directly (or so I understand, could be way off).

Perhaps he should institute some kind of prize for work done, trying to give extra prestige and funding to work going in his preferred direction? Does this exist in some form without my noticing? Is there a reason it'd be bad? Time/energy usage for Eliezer combined with difficulty of delegation?

The Research Engineer job for the Alignment team is no longer open - is this because it's reached some threshold of applications? In any case might not be helpful to advertise!

Thanks for doing this though, the context is very useful (I've applied as RE to both).

The synthesis of these options would be an AGI research group whose plan consists of:

  • Develop safe AGI.
  • Try to convince world governments to perform some such pivotal act (Idea A) - note that per current institutions this needs consensus and strong implementation across all major and medium tech powers.
  • Have a back-up plan, if AGI research is proliferating without impending shutdown, to shut down world research unilaterally (Idea B).

What do you think of such a plan?

I think this would be reasonable, but if the plan is taken up then it becomes a cost-benefit analysis of when Idea B should be deployed, which plausibly could be very aggressive, so it could easily boil down to just Idea B. 

It's also worth noting that a research group with an AGI who want world governments to perform a pivotal act would need to be incredibly effective and persuasive. Their options would run a spectrum from normal public-channel and lobbying efforts to AGI-takes-over-the-world-behind-the-scenes (depending on sufficient capability), with a variety of AGI-assisted persuasion techniques in between. At some degree of AI/research group control over government, it's not clear if this would be an improvement over the original act. Demonstrating the power of AGI in a way that would force governments to listen would need to at least threaten a transformative act (self-driving cars, solving protein folding, passing normal Turing tests clearly aren't enough) and so the necessary levels of influence and demonstrated capability would be large (and demonstrating capability has obvious potential drawbacks in sparking arms races).

I mentioned it in my standalone post but I'll register a question here:

In the counterexamples for 'Strategy: train a reporter that is useful for another AI', the main difficulty is the ability for agents to hide information in human language somehow, given the many available degrees of freedom.

I grant that this is a big risk but one advantage we have is that if we trained multiple agents, they would all be encoding hidden information, but most likely they would all encode this extra information in different ways.

The question is, given multiple agents encoding hidden messages in the reporter's answers, can we find a training strategy which penalizes or removes any kind of information in the answers which is not common to all agents?

My suggestion, based on my idea of internal note-taking rather than reporting per-se, would be to take the 'notes' on the internal state written by one agent and feeding these to another agent. This seems like it would eventually converge on them sharing the same hidden encoding though, albeit with more difficulty. Apologies if not clear more spelled out in the full post, but yeah, are there better proposals, or reasons why this kind of strategy wouldn't work?

I wonder if the discussion of the scientific capabilities of e.g. GPT-3 would be more productive if it were anchored to some model of the wider scientific feedback loop in which it's situated?

Consider three scenarios:

  • A: A model trained to predict the shapes of proteins from their DNA sequences.
  • B: A model with access to custom molecule synthesizer and high-volume, high-quality feedback about the scientific value of its text production, trained to write papers with scientific value about the nature of the molecules produced.
  • C: A model given the resources of an psychology department and trained to advance the state of psychological understanding (without just hiring other humans).

As we go from A to C we see a decrease in the quality of the feedback loop, and with it an increasing need for general, rather than narrow intelligence. I would argue that even A should count as doing science, since it advances the state of the art knowledge about an important phenomena, and current models are clearly capable of doing so. C is clearly well beyond the capabilities of GPT-3 and also many well qualified, intelligent scientists, because the feedback loop is so poor. B is intermediate and I expect beyond GPT-3 but I'm not confident that current techniques couldn't provide value.

If you're interested in taking the point further, perhaps one of you could specify the loosest scientific feedback loop that you think the current paradigm of AI is capable of meaningfully participating in?

Turns out our methods are not actually very path-dependent in practice!

Yeah I get that's what Mingard et al are trying to show but the meaning of their empirical results isn't clear to me - but I'll try and properly read the actual paper rather than the blog post before saying any more in that direction.

"Flat minimum surrounded by areas of relatively good performance" is synonymous with compression. if we can vary the parameters in lots of ways without losing much performance, that implies that all the info needed for optimal performance has been compressed into whatever-we-can't-vary-without-losing-performance.

I get that a truly flat area is synonymous with compression - but I think being surrounded by areas of good performance is anti-correlated with compression because it indicates redundancy and less-than-maximal sensitivity. 

I agree that viewing it as flat eigendimensions in parameter space is the right way to think about it, I still worry that the same concerns apply that maximal compression in this space is traded against ease of finding what would be a flat plain in many dimensions, but a maximally steep ravine in all of the other directions. I can imagine this could be investigated with some small experiments, or they may well already exist but I can't promise I'll follow up, if anyone is interested let me know.

Cheers for posting! I've got a question about the claim that optimizers compress by default, due to the entropy maximization-style argument given around 20:00 (apologies if you covered this, it's not easy to check back through a video):

Let's say that we have a neural network of width 100, which is trained on a dataset which could be trained to perfect accuracy on a network of width of only 30. If it compresses it into only 30 weights there's a 70-dimensional space of free parameters and we should expect a randomly selected solution to be of this kind. 

I agree that if we randomly sample zero-loss weight configurations, we end up with this kind of compression, but it seems that any kind of learning we know how to do is dependent on the paths that one can take to reach it, and that abstracting this away can give very different results to any high-dimensional optimization that we actually know how to do. 

Assuming that the network is parameterized by, say, float16s, maximal compression of the data would result in the output of the network being sensitive to the final bit of the weights in as many cases as possible, thereby leaving the largest number of free bits, so 16 bits of info would be compressed in to one weight, rather than spread among 3-4.

My intuition is that these highly compressed arrangements would be very sensitive to perturbations, and render them incredibly difficult to reach in practice (and also have a big problem with an unknown examples, and are therefore screened off by techniques like dropout and regularization). There is therefore a competing incentive towards minima which are easy to land on - probably flat minima surrounded by areas of relatively good performance. Further, I expect that these kind of minima tend to leverage the whole network for redundancy and flatness (not needing to depend tightly on the final bit of weights).

The properties of would be not just compression but some combination of compression and smoothness (smoothness being sort of a variant of compression where the final bits don't matter much) which would not result in some subset of the parameters having all the useful information. 

If you agree that this is what happens, in what sense is there really compression, if the info is spread among multiple bits? Perhaps given the structure of NNs, we should expect to be able to compress by removing the last bits of weights as these are the easiest to leave free given the structure of training?

If you disagree I'd be curious to know where. I sense that Mingard et al shares your conclusion but I don't yet understand the claimed empirical demonstration.

tldr: optimization may compress by default, but learning seems to counteract this by choosing easy-to-find minima.

Load More