David Manheim


Modeling Transformative AI Risk (MTAIR)

Wiki Contributions


So the code that wires a 100-trillion-synapse human brain is about 7.5 megabytes. Now an adult human contains a lot more information than this.


Minor quibble which seems to have implications - "There is a consensus that there are roughly about 100 billion neurons total in the human brain. Each of these neurons can have up to 15,000 connections with other neurons via synapses"

My rough understanding is that babies' brains greatly increase how many synapses there are until age 2 or 3, then these are eliminated or become silent in older children and adults. But this implies that there's a ton of connections, and most of the conditioning and construction of the structure is environmental, not build into the structure via genetics.

Thinking about this a bit, (not a huge amount,) I think the specific example "are bugs real" ends up looking interesting in part because the word "bugs" in the prompt has incredibly low likelihood. (As does the following word, "real") 

So the model is conditioning on very low likelihood inputs, which seems like part of the reason for the behavior.

I think what you call grader-optimization is trivially about how a target diverges from the (unmeasured) true goal, which is adversarial goodhart (as defined in paper, especially how we defined Campbell’s Law, not the definition in the LW post.) 

And the second paper's taxonomy, in failure mode 3, lays out how different forms of adversarial optimization in a multi-agent scenario relate to Goodhart's law, in both goal poisoning and optimization theft cases - and both of these seem relevant to the questions you discussed in terms of grader-optimization.

This relates closely to how to "solve" Goodhart problems in general. Multiple metrics / graders make exploitation more complex, but have other drawbacks. I discussed the different approaches in my paper here, albeit in the realm of social dynamics rather than AI safety.

This seems great!

If you are continuing work in this vein, I'd be interested in you looking at how these dynamics relate to different Goodhart failure modes, as we expanded on here. I think that much of the problem relates to specific forms of failure, and that paying attention to those dynamics could be helpful. I also think they accelerate in the presence of multiple agents - and I think the framework I pointed to here might be useful.

Is the best way to suggest how to do political and policy strategy, or coordination, to post it publicly on Lesswrong? This seems obviously suboptimal, and I'd think that you should probably ask for feedback and look into how to promote cooperation privately first.

That said, I think everything you said here is correct on an object level, and worth thinking about.

Strongly agree. Three examples of work I've put on Arxiv which originated from the forum, which might be helpful as a touchstone. The first was cited 7 times the first year, and 50 more times since.  The latter two were posted last year, and have not been indexed by Google as having been cited yet. 

As an example of a technical but fairly conceptual paper, there is the Categorizing Goodhart's law paper. I pushed for this to be a paper rather than just a post, and I think that the resulting exposure was very worthwhile. Scott wrote the original post, though we had discussed Goodhart's Law quite a bit in LA, and I had written about it on Ribbonfarm. I think the post took significantly less than 300 hours of specific work, but much more than that in earlier thinking and discussions. The comments and discussion around the post was probably fifty hours, but extending it to cover the items I disagreed with, writing it in Latex, making diagrams, and polishing the paper took about another hundred hours between myself, Scott, and others who helped with editing and proofreading.

As an example of a large project with a final report, we commissioned an edited summary report / compilation of our MTAIR sequence. This was at least a thousand hours of total work on the project, probably closer to 3,000, including all the work on the project and writing. The marginal work over the project and posts was a couple thousand dollars in editing, probably amounting to a few dozen hours of work. (We did not move it to latex, and the diagrams were screenshots rather than being done nicely in Latex.)

As an example of a conceptual paper that we put on .CY, here is a model of why people are working on agent foundations which Issa initially posted on the alignment forum. I pushed for rewiting and posting it on ArXiv. I guesstimate no more than 50 hours of work by Issa for the original post, and perhaps another 100 hours total writing and editing for ArXiv. It gets less attention than more technical work, but was also less work. I think that's fine, and it's valuable as a more authoritative reference for the arguments than existed previously.

There's also a poorly researched post on "dynamic safety envelopes" which I put together for other reasons, was never on the forum, and which I didn't realized was already superseded by Paul Christiano's and other's work on various topics. In retrospect, this should not have been put on ArXiv.

Seconding the .tex export, since it's much more useful than just getting a pdf!

That's correct. My point is that measuring goals which are not natural to measure will, in general, have many more problems with Goodharting and similar misoptimization and overoptimization pressures. And other approaches can be more productive, or at least more care is needed with design of metrics rather than discovery of what to measure and how.

Load More