I'm an admin of LessWrong. Here are a few things about me.
Further detail on this: Cotra has more recently updated at least 5x against her original 2020 model in the direction of faster timelines.
Greenblatt writes:
Here are my predictions for this outcome:
- 25th percentile: 2 year (Jan 2027)
- 50th percentile: 5 year (Jan 2030)
Cotra replies:
My timelines are now roughly similar on the object level (maybe a year slower for 25th and 1-2 years slower for 50th)
This means 25th percentile for 2028 and 50th percentile for 2031-2.
The original 2020 model assigns 5.23% by 2028 and 9.13% | 10.64% by 2031 | 2032 respectively. Each time a factor of ~5x.
However, the original model predicted the date by which it was affordable to train a transformative AI model. This is a leading a variable on such a model actually being built and trained, pushing back the date by some further number of years, so view the 5x as bounding, not pinpointing, the AI timelines update Cotra has made.
Third, we gave the model a secret “scratchpad”—a workspace where it could record its step-by-step reasoning. We told the model to use the scratchpad to reason about what it should do (as far as the model was aware, the scratchpad was invisible to users and to Anthropic). This allowed us to observe whether the model planned to fake alignment, and how it would go about doing so.
This feels like amusingly like tricking a child. "Remember kiddo, you can reason out loud about where you're going to hide and I won't hear it. Now let's play hide and seek!"
I don't know how to quickly convey why I find this point so helpful, but I find this to be a helpful pointer to a key problem, and the post is quite short, and I hope someone else positively votes on it. +4.
Or you guys could find a 1-2 hour window to show up and live-chat in a LW dialogue, then publish the results :-)
Curated. I thought this was a valuable list of areas, most of which I haven't thought that much about, and I've certainly never seen them brought together in one place before, which I think itself is a handy pointer to the sort of work that needs doing.
You explicitly assume this stuff away, but I believe under this setup that the subagents would be incentivized to murder each other before the button is pressed (to get rid of that annoying veto).
I also note that if one agent becomes way way smarter than the other, that this balance may not work out.
Even if it works, I don't see how to set up the utility functions such that humans aren't disempowered. That's a complicated term!
Overall a very interesting idea.
Something a little different: Today I turn 28. If you might be open to do something nice for me for my birthday, I would like to request the gift of data. I have made a 2-4 min anonymous survey about me as a person, and if you have a distinct sense of me as a person (even just from reading my LW posts/comments) I would greatly appreciate you filling it out and letting me know how you see me!
Here's the survey.
It's an anonymous survey where you rate me on lots of attributes like "anxious", "honorable", "wise" and more. All multiple-choice. Two years ago I also shared a birthday survey amongst people who know me and ~70 people filled it out, and I learned a lot from it. I am very excited to see how the perception of me amongst the people I know has *changed*, and also to find out how people on LessWrong see me, so the core of this survey is ~20 of the same attributes.
In return for your kind gift, if you complete it, you get to see the aggregate ratings of me from last time!
This survey helps me understand how people see me, and recognize my blindspots, and I'm very grateful to anyone who takes a few mins to complete it. Two people completed it already and it took them 2 mins and 4 mins to complete it. (There are many further optional questions but it says clearly when the main bit is complete.)
I of course intend to publish the (aggregate) data in a LW post and talk about what I've learned from it :-)