An ability to refuse to generate theories about a hypothetical world being in a simulation.
I guess the problem with this test is that the kinds of people who could do this tend to be busy, so they probably can't do this with so little notice.
Hmm... It seems much, much harder to catch every single one than to catch 99%.
Regarding the point about most alignment work not really addressing the core issue: I think that a lot of this work could potentially be valuable nonetheless. People can take inspiration from all kinds of things and I think there is often value in picking something that you can get a grasp on, then using the lessons from that to tackle something more complex. Of course, it's very easy for people to spend all of their time focusing on irrelevant toy problems and never get around to making any progress on the real problem. Plus there are costs with adding more voices into the conversation as it can be tricky for people to distinguish the signal from the noise.
If we tell an AI not to invent nanotechnology, not to send anything to protein labs, not to hack into all of the world's computers, not to design weird new quantum particles, not to do 100 of the other most dangerous and weirdest things we can think of, and then ask it to generalize and learn not to do things of that sort
I had the exact same thought. My guess would be that Eliezer might say that since the AI is maximising if the generalisation function misses even one action of this sort as something that we should exclude that we're screwed.
I tend to value a longer timeline more than a lot of other people do. I guess I see EA and AI Safety setting up powerful idea machines that get more powerful when they are given more time to gear up. A lot more resources have been invested into EA field-building recently, but we need time for these investments to pay off. At EA London this year, I gained a sense that AI Safety movement building is only now becoming its own thing; and of course it'll take time to iterate to get it right, then time for people to pass through the programs, then time for them to have a career.
I suspect the kind of argument that we need more capabilities to make progress might have been stronger earlier in the game, but now that we already have powerful language models, there's a lot that we can do without needing AI to advance any further.
How large do you expect Conjecture to become? What percent of people do you expect to be working on the product and what percentage to be working on safety?
Random idea: A lot of people seem discouraged from doing anything about AI Safety because it seems like such a big overwhelming problem.
What if there was a competition to encourage people to engage in low-effort actions towards AI safety, such as hosting a dinner for people who are interested, volunteering to run a session on AI safety for their local EA group, answering a couple of questions on the stampy wiki, offering to proof-read a few people’s posts or offering a few free tutorial sessions to aspiring AI Safety Researchers.
I think there’s a decent chance I could get this funded (prize might be $1000 for the best action and up to 5 prizes of $100 for random actions above a certain bar)
Possible downsides: Would be bad if people reach out to important people or the media without fully thinking stuff through, but can be mitigated by excluding those kinds of actions/ adding guidelines
Keen for thoughts or feedback.
Thoughts on the introduction of Goodhart's. Currently, I'm more motivated by trying to make the leaderboard, so maybe that suggests that merely introducing a leaderboard, without actually paying people, would have had much the same effect. Then again, that might just be because I'm not that far off. And if there hadn't been the payment, maybe I wouldn't have ended up in the position where I'm not that far off.
I guess I feel incentivised to post a lot more than I would otherwise, but especially in the comments rather than the posts since if you post a lot of posts that likely suppresses the number of people reading your other posts. This probably isn't a worthwhile tradeoff given that one post that does really well can easily outweight 4 or 5 posts that only do okay or ten posts that are meh.
Another thing: downvotes feel a lot more personal when it means that you miss out on landing on the leaderboard. This leads me to think that having a leaderboard for the long term would likely be negative and create division.
If we have an algorithm that aligns an AI with X values, then we can add human values to get an AI that is aligned with human values.
On the other hand, I agree that it doesn't really make sense to declare an AI safe in the abstract, rather than in respect to say human values. (Small counterpoint: in order to be safe, it's not just about alignment, you also need to avoid bugs. This can be defined without reference to human values. However, this isn't sufficient for safety).
I suppose this works as a criticism of approaches like quantisers or impact-minimisation which attempt abstract safety. Although I can't see any reason why it'd imply that it's impossible to write an AI that can be aligned with arbitrary values.