All of Hoagy's Comments + Replies

Steganography in Chain of Thought Reasoning

Yeah I agree demonstrating it and examining it is very important, especially in chain-of-thought where the existence of hidden information isn't a certainty.

They're a bit out of date but the original proposal is ELK Sub - Note-taking in internal rollouts and the multiple model protocol is in Note Taking Without Hidden Messages.

Haven't done any proper testing but it's high up on my agenda, would be interested in working out what tests would be best - will elaborate later.

Steganography in Chain of Thought Reasoning

For the ELK contest I was working on something similar, but where the problem is more acute (you're trying to force a MuZero-style reinforcement learner to compress its hidden state into words).

The methods I proposed to deal with this are:

  • Cut down the null space by training a language model to rewrite the written words in a different format which preserves the meaning as closely as possible, so that the CoT reasoner is forced to respond to similar sentences in similar ways.
  • Use multiple models which are trained separately but have to respond to each others'
... (read more)
2Alex Gray10d
Putting what I think is the most important part of my reply first: I think research into mitigations is premature and instead demonstrating/measuring the phenomena should take priority. However given that, I think I agree that these are all possible mitigations to the phenomena, in particular (rephrasing your points): * Reducing the channel capacity [https://en.wikipedia.org/wiki/Channel_capacity], and thus the amount of information that can be transmitted * Constraining the outputs, and thus reducing the degrees of freedom of the null space * Adding noise (via dropout or otherwise) also reduces the channel capacity, and probably also has some desirable side effects (like increasing the diversity or creativity of reasoning) * Introducing "barriers" past which optimization for hidden information becomes difficult (in your example, you use different models; I also think "stop gradients" in end-to-end training has similar effects, if weaker) * If you can detect/localize this phenomena, you might be able to use interpretability like techniques to limit or constrain it * I'm less certain about this, but possible future interpretability research will unlock this ability. Also I think an important point here is that we can expect to see it with fairly simple techniques -- merely including sampled/generated data in the training set is sufficient, as opposed to it requiring a complicated reinforcement learning algorithm like MuZero. Did you publish your proposal? I'd be interested in reading it.
Bridging Expected Utility Maximization and Optimization

I'm really interested to see this progress, it would feel very healthy if we could have a solid integrated definition of optimizer to work with.

I'm not sure I understand why you don't agree with the 'small' criterion for the target set. It seems that you should be able to say something about the likelihood of the target in the absence of any agent (or if the agent takes a max-ent distribution over actions or something), and that's the relevant notion of smallness, which then becomes large in the presence of the agent. Or is it that you expect it to be diff... (read more)

Hoagy's Shortform

Suggestion:

Eliezer has huge respect in the community; he has strong, well thought-out opinions (often negative) on a lot of the safety research being done (with exceptions, Chris Olah mentioned a few times); but he's not able to work full time on research directly (or so I understand, could be way off).

Perhaps he should institute some kind of prize for work done, trying to give extra prestige and funding to work going in his preferred direction? Does this exist in some form without my noticing? Is there a reason it'd be bad? Time/energy usage for Eliezer combined with difficulty of delegation?

DeepMind is hiring for the Scalable Alignment and Alignment Teams

The Research Engineer job for the Alignment team is no longer open - is this because it's reached some threshold of applications? In any case might not be helpful to advertise!

Thanks for doing this though, the context is very useful (I've applied as RE to both).

2Rohin Shah3mo
Should be fixed now!
1[comment deleted]3mo
“Pivotal Act” Intentions: Negative Consequences and Fallacious Arguments

The synthesis of these options would be an AGI research group whose plan consists of:

  • Develop safe AGI.
  • Try to convince world governments to perform some such pivotal act (Idea A) - note that per current institutions this needs consensus and strong implementation across all major and medium tech powers.
  • Have a back-up plan, if AGI research is proliferating without impending shutdown, to shut down world research unilaterally (Idea B).

What do you think of such a plan?

I think this would be reasonable, but if the plan is taken up then it becomes a cost-benefit an... (read more)

ELK prize results

I mentioned it in my standalone post but I'll register a question here:

In the counterexamples for 'Strategy: train a reporter that is useful for another AI', the main difficulty is the ability for agents to hide information in human language somehow, given the many available degrees of freedom.

I grant that this is a big risk but one advantage we have is that if we trained multiple agents, they would all be encoding hidden information, but most likely they would all encode this extra information in different ways.

The question is, given multiple agents encod... (read more)

Ngo and Yudkowsky on scientific reasoning and pivotal acts

I wonder if the discussion of the scientific capabilities of e.g. GPT-3 would be more productive if it were anchored to some model of the wider scientific feedback loop in which it's situated?

Consider three scenarios:

  • A: A model trained to predict the shapes of proteins from their DNA sequences.
  • B: A model with access to custom molecule synthesizer and high-volume, high-quality feedback about the scientific value of its text production, trained to write papers with scientific value about the nature of the molecules produced.
  • C: A model given the resources of
... (read more)
The Big Picture Of Alignment (Talk Part 1)

Turns out our methods are not actually very path-dependent in practice!

Yeah I get that's what Mingard et al are trying to show but the meaning of their empirical results isn't clear to me - but I'll try and properly read the actual paper rather than the blog post before saying any more in that direction.

"Flat minimum surrounded by areas of relatively good performance" is synonymous with compression. if we can vary the parameters in lots of ways without losing much performance, that implies that all the info needed for optimal performance has been compresse

... (read more)
The Big Picture Of Alignment (Talk Part 1)

Cheers for posting! I've got a question about the claim that optimizers compress by default, due to the entropy maximization-style argument given around 20:00 (apologies if you covered this, it's not easy to check back through a video):

Let's say that we have a neural network of width 100, which is trained on a dataset which could be trained to perfect accuracy on a network of width of only 30. If it compresses it into only 30 weights there's a 70-dimensional space of free parameters and we should expect a randomly selected solution to be of this kind. ... (read more)

3johnswentworth6mo
This is where Mingard et al [https://towardsdatascience.com/neural-networks-are-fundamentally-bayesian-bee9a172fad8] come in. One of their main results is that SGD training on neural nets does quite well approximate just-randomly-sampling-an-optimal-point. Turns out our methods are not actually very path-dependent in practice! There is a mismatch between your intuition and the implications of "flat minima surrounded by areas of relatively good performance". Remember, the whole point of the "highly compressed arrangements" is that we only need to lock in a few parameter values in order to get optimal behavior; once those few values are locked in, the rest of the parameters can mostly vary however they want without screwing stuff up. "Flat minimum surrounded by areas of relatively good performance" is synonymous with compression: if we can vary the parameters in lots of ways without losing much performance, that implies that all the info needed for optimal performance has been compressed into whatever-we-can't-vary-without-losing-performance. Now, your intuition is correct in the sense that info may be spread over many parameters; the relevant "ways to vary things" may not just be "adjust one param while holding others constant". For instance, it might be more useful to look at parameter variation along local eigendirections of the Hessian. Then the claim would be something like "flat optimum = performance is flat along lots of eigendirections, therefore we can project the parameter-values onto the non-flat eigendirections and those projections are the 'compressed info'". (Tbc, I still don't know what the best way is to characterize this sort of thing, but eigendirections are an obvious approximation which will probably work.)
Some Hacky ELK Ideas

I see John agrees with the 'one-time' label but it seems a bit too strong to me, especially if the kind of optimization is 'lets try a totally different approach', rather than continuing to train the same system, or focusing on exactly why it spoofed one sensor but not the other. Just to think it through:

There are three types of system that are important: type A which fails on the validation/holdout data, type B which succeeds on validation but not test/real-world data, and type C, which succeeds on both. We are looking for type C, and we use the validatio... (read more)

5Abram Demski6mo
I agree. There's nothing magical about "once". I almost wrote "once or twice", but it didn't sit well with the level of caution I would prefer be the norm. While your analysis seems correct, I am worried if that's the plan. I think a safety team should go into things with the attitude that this type of thing is important a last-line-of-defense, but should never trigger. The plan should involve a strong argument that what's being build is safe. In fact if this type of safeguard gets triggered, I would want the policy to be to go back to the drawing board, take the new information into account, and come up with a more well-argued plan. The new plan can have "never to be used" safeguards like this, but hopefully it has more and different ones this time. If, on the other hand, a safety team goes in with the idea that John's safeguard can be iterated a few times as you argue, then I anticipate them fooling themselves by iterating too many times and coming up with a plan that accidentally skirts the safeguard in some hard-to-notice way. (I have no reason to expect these sorts of anticipations to be calibrated; I'm just thinking cautiously here.)
Some Hacky ELK Ideas

I was going to write something similar, and just wanted to add that this problem can be expected to get worse the more non-holdout sensors there are. If there were just a single non-holdout camera then spoofing only the one camera would be worthwhile - but if there were a grid of cameras with just a few being held out then it would likely be easiest to take an action that fools them all, like a counterfeit diamond. 

This method would work best when there be whole modes of data which are ignored, and the work needed to spoof them is orthogonal to all non-holdout modes.

Hoagy's Shortform

I've been looking at papers involving a lot of 'controlling for confounders' recently and am unsure about how much weight to give their results.

Does anyone have recommendations about how to judge the robustness of these kind of studies?

Also, I was considering doing some tests of my own based on random causal graphs, testing what happens to regressions when you control for a limited subset of confounders, varying the size/depth of graph and so on. I can't seem to find any similar papers but I don't know the area, does anyone know of similar work?

Rogue AGI Embodies Valuable Intellectual Property

This employee has 100 million dollars, approximately 10,000x fewer resources than the hedge fund. Even if the employee engaged in unethical business practices to achieve a 2x higher yearly growth rate than their former employer, it would take 13 years for them to have a similar amount of capital.

I think it's worth being explicit here about whether increases in resources under control are due to  appreciation of existing capital or allocation of new capital.

If you're talking about appreciation, then if the firm earns 5% returns on average and the rogue... (read more)

Big picture of phasic dopamine

Cheers for the post, I find the whole series fascinating.

One thing I was particularly curious about is how these 'proposals' are made. Do you have a picture of what kind of embedding is used to present a potential action? 

For example, is a proposal encoded in the activations of set of neurons that are isomorphic to the motor neurons and it could then propose tightening a set of finger muscles through specific neurons? Or is the embedding jointly learned between the two in some large unstructured connection, or smaller latent space, or something completely different?

3Steve Byrnes1y
The least-complicated case (I think) is: I (tentatively) think that the hippocampus is more-or-less a lookup table with a finite number of discrete thoughts / memories / locations / whatever (the type of content in different in different species), and a "proposal" is just "which of the discrete things should be activated right now". A medium-difficulty case is: I think motor cortex stores a bunch of sequences of motor commands which execute different common action sequences. (I'm a believer in the Graziano theory [https://pubmed.ncbi.nlm.nih.gov/17964243/]that primary motor cortex, secondary motor cortex, supplementary motor cortex, etc. etc., are all doing the same kind of thing and should be lumped together.) The exact details of the data structures that the brain uses to store these sequences of motor commands are controversial and I don't want to get into it here… Then the hardest case is the areas that "think thoughts", spawn new ideas, etc., all the cool stuff that leads to human intelligence. (e.g. dorsolateral prefrontal cortex I think.) Things like "I'm going to go to the store" or "what if I differentiate both sides of the equation?". Those things are clearly not isomorphic to a sequence of motor commands. It's higher-level than that. Again, the exact data structures and algorithms involved in representing and searching for these "thoughts" is a very big and controversial topic that I don't want to get into here…
Testing The Natural Abstraction Hypothesis: Project Intro

Another little update, speed issue solved for now by adding SymPy's fortran wrappers to the derivative calculations - calculating the SVD isn't (yet?) the bottleneck. Can now quickly get results from 1,000+ step simulations of 100s of particles. 

Unfortunately, even for the pretty stable configuration below, the values are indeed exploding. I need to go back through the program and double check the logic but I don't think it should be chaotic, if anything I would expect the values to hit zero.

It might be that there's some kind of quasi-chaotic behaviou... (read more)

2johnswentworth1y
If the wheels are bouncing off each other, then that could be chaotic in the same way as billiard balls. But at least macroscopically, there's a crapton of damping in that simulation, so I find it more likely that the chaos is microscopic. But also my intuition agrees with yours, this system doesn't seem like it should be chaotic...
Testing The Natural Abstraction Hypothesis: Project Intro

Been a while but I thought the idea was interesting and had a go at implementing it. Houdini was too much for my laptop, let alone my programming skills, but I found a simple particle simulation in pygame which shows the basics, can see below.

exponents of the Jacobian of a 5 particle, 200 step simulation, with groups of 3 and 2 connected by springs

Planned next step is to work on the run-time speed (even this took a couple of minutes run, calculating the frame-to-frame Jacobian is a pain, probably more than necessary) and then add some utilities for creatin... (read more)

3johnswentworth1y
Nice! A couple notes: * Make sure to check that the values in the jacobian aren't exploding - i.e. there's not values like 1e30 or 1e200 or anything like that. Exponentially large values in the jacobian probably mean the system is chaotic. * If you want to avoid explicitly computing the jacobian, write a method which takes in a (constant) vectoruand uses backpropagation to return∇x0(xt⋅u). This is the same as the time-0-to-time-t jacobian dotted withu, but it operates on size-n vectors rather than n-by-n jacobian matrices, so should be a lot faster. Then just wrap that method in a LinearOperator [https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.linalg.LinearOperator.html] (or the equivalent in your favorite numerical library), and you'll be able to pass it directly to an SVD method [https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.linalg.svds.html] . In terms of other uses... you could e.g. put some "sensors" and "actuators" in the simulation, then train some controller to control the simulated system, and see whether the data structures learned by the controller correspond to singular vectors of the jacobian. That could make for an interesting set of experiments, looking at different sensor/actuator setups and different controller architectures/training schemes to see which ones do/don't end up using the singular-value structure of the system.
Testing The Natural Abstraction Hypothesis: Project Intro

Reading this after Steve Byrnes' posts on neuroscience gives a potentially unfortunate view on this.

The general impression is that the a lot of our general understanding of the world is carried in the neocortex which is running a consistent statistical algorithm and the fact that humans converge on similar abstractions about the world could be explained by the statistical regularities of the world as discovered by this system. At the same time, the other parts of the brain have a huge variety of structures and have functions which are the products of evolu... (read more)

5johnswentworth1y
Here's one fairly-standalone project which I probably won't get to soon. It would be a fair bit of work, but also potentially very impressive in terms of both showing off technical skills and producing cool results. Short somewhat-oversimplified version: take a finite-element model of some realistic objects. Backpropagate to compute the jacobian of final state variables with respect to initial state variables. Take a singular value decomposition of the jacobian. Hypothesis: the singular vectors will roughly map to human-recognizable high-level objects in the simulation (i.e. the nonzero elements of any given singular vector should be the positions and momenta of each of the finite elements comprising one object). Longer version: conceptually, we imagine that there's some small independent Gaussian noise in each of the variables defining the initial conditions of the simulation (i.e. positions and momenta of each finite element). Assuming the dynamics are such that the uncertainty remains small throughout the simulation - i.e. the system is not chaotic - our uncertainty in the final positions is then also Gaussian, found by multiplying the initial distribution by the jacobian matrix. The hypothesis that information-at-a-distance (in this case "distance" = later time) is low-dimensional then basically says that the final distribution (and therefore the jacobian) is approximately low-rank. In order for this to both work and be interesting, there are some constraints on both the system and on how the simulation is set up. First, "not chaotic" is a pretty big limitation. Second, we want the things-simulated to not just be pure rigid-body objects, since in that case it's pretty obvious that the method will work and it's not particularly interesting. Two potentially-interesting cases to try: * Simulation of an elastic object with multiple human-recognizable components, with substantial local damping to avoid small-scale chaos. Cloth or jello or a sticky hand or
Developmental Stages of GPTs

I agree that this is the biggest concern with these models, and the GPT-n series running out of steam wouldn't be a huge relief. It looks likely that we'll have the first human-scale (in terms of parameters) NNs before 2026 - Metaculus, 81% as of 13.08.2020.

Does anybody know of any work that's analysing the rate at which, once the first NN crosses the n-parameter barrier, other architectures are also tried at that scale? If no-one's done it yet, I'll have a look at scraping the data from Papers With Code's databases on e.g. I... (read more)

Preparing for "The Talk" with AI projects

Hey Daniel, don't have time for a proper reply right now but am interested in talking about this at some point soon. I'm currently in UK Civil Service and will be trying to speak to people in their Office for AI at some point soon to get a feel for what's going on there, perhaps plant some seeds of concern. I think some similar things apply.

1Daniel Kokotajlo2y
Sure, I'd be happy to talk. Note that I am nowhere near the best person to talk to about this; there are plenty of people who actually work at an AI project, who actually talk to AI scientists regularly, etc.
Soft takeoff can still lead to decisive strategic advantage

I think this this points to the strategic supremacy of relevant infrastructure in these scenarios. From what I remember of the battleship era, having an advantage in design didn't seem to be a particularly large advantage - once a new era was entered, everyone with sufficient infrastructure switches to the new technology and an arms race starts from scratch.

This feels similar to the AI scenario, where technology seems likely to spread quickly through a combination of high financial incentive, interconnected social networks, state-sponsored espionage e... (read more)

Torture and Dust Specks and Joy--Oh my! or: Non-Archimedean Utility Functions as Pseudograded Vector Spaces

Apologies if this is not the discussion you wanted, but it's hard to engage with comparability classes without a framework for how their boundaries are even minimally plausible.

Would you say that all types of discomfort are comparable with higher quantities of themselves? Is there always a marginally worse type of discomfort for any given negative experience? So long as both of these are true (and I struggle to deny them) then transitivity seems to connect the entire spectrum of negative experience. Do you think there is a way to remove the transitivity of comparability and still have a coherent system? This, to me, would be the core requirement for making dust specks and torture incomparable.


3Louis_Brown3y
I agree that delineating the precise boundaries of comparability classes is a uniquely challenging task. Nonetheless, it does not mean they don't exist--to me your claim feels along the same lines as classical induction "paradoxes" involving classifying sand heaps [https://en.wikipedia.org/wiki/Sorites_paradox] . While it's difficult to define exactly what a sand heap is, we can look at many objects and say with certainty whether or not they are sand heaps, and that's what matters for living in the world and making empirical claims (or building sandcastles anyway). I suspect it's quite likely that experiences you may be referring to as "higher quantities of themselves" within a single person are in fact qualitatively different and no longer comparable utilities in many cases. Consider the dust specks: they are assumed to be minimally annoying and almost indetectable to the bespeckèd. However, if we even slightly upgrade them so as to cause a noticeable sting in their targeted eye, they appear to reach a whole different level. I'd rather spend my life plagued by barely noticeable specks (assuming they have no interactions) than have one slightly burn my eyeball.
Topological Fixed Point Exercises

I've realised that you've gotta be careful with this method because when you find a trichromatic subtriangle of the original, it won't necessarily have the property of only having points of two colours along the edges, and so may not in fact contain a point that maps to the centre.

This isn't a problem if we just increase the number n by which we divide the whole triangle instead of recursively dividing subtriangles. Unfortunately now we're not reducing the range of co-ords where this fixed point must be, only finding a triad of ar... (read more)

3Chris_Leong4y
Yeah, you're right. That breaks the proof. I don't know how to deal with it yet.
Topological Fixed Point Exercises

Cleanest solution I can find for #8:

Also, if we have a proof for #6 there's a pleasant method for #7 that should work in any dimension:

We take our closed convex set that has the bounded function . We take a triangle that covers so that any point in is also in .

Now we define a new function such that where is the function that maps to the nearest point in .

By #6 we know that has a fixed point, since is continuous. We know that the fixed point of cannot lie outside because th... (read more)

On my approach:

I constructed a large triangle around the convex shape with the center somewhere in the interior. I then projected each point in the convex shape from the center towards the edge of the triangle in a proportional manner. ie. The center stays where it is, the points on the edge of the convex shape are projected to the edge of the triangle and a point 1/x of the distance from the center to the edge of the convex shape is 1/x of the distance from the center to the edge of the triangle.

Topological Fixed Point Exercises

Yeah agreed, in fact I don't think you even need to continually bisect, you can just increase n indefinitely. Iterating becomes more dangerous as you move to higher dimensions because an n dimensional simplex with n+1 colours that has been coloured according to analogous rules doesn't necessarily contain the point that maps to zero.

On the second point, yes I'd been assuming that a bounded function had a bounded gradient, which certainly isn't true for say sin(x^2), the final step needs more work, I like the way you did it in the proof below.

2Czynski4y
Topological Fixed Point Exercises

Here's a messy way that at least doesn't need too much exhaustive search:

First let's separate all of the red nodes into groups so that within each group you can get to any other node in that group only passing through red nodes, but not to red nodes in any other group.

Now, we trace out the paths that surround these groups - they immediately look like the paths from Question 1 so this feels like a good start. More precisely, we draw out the paths such that each vertex forms one side of a triangle that has a blue node at its opposite corner. ... (read more)

Topological Fixed Point Exercises

I was able to get at least (I think) close to proving 2 using Sperner's Lemma as follows:

You can map the continuous function f(x) to a path of the kind found in Question 1 of length n+1
by evaluating f(x) at x=0, x=1 and n-1 equally spaced divisions between these two points and setting a node as blue if f(x) < 0 else as green.

By Sperner's Lemma there is an odd, and therefore non-zero number of b-g vertices. You can then take any b-g pair of nodes as the starting points for a new path and repeat the process. After k iterations you have two v... (read more)

I'm having trouble understanding why we can't just fix in your proof. Then at each iteration we bisect the interval, so we wouldn't be using the "full power" of the 1-D Sperner's lemma (we would just be using something close to the base case).

Also if we are only given that is continuous, does it make sense to talk about the gradient?