AI ALIGNMENT FORUM
AF

40
Thomas Kwa
Ω5783840
Message
Dialogue
Subscribe

Member of technical staff at METR.

Previously: MIRI → interp with Adrià and Jason → METR.

I have signed no contracts or agreements whose existence I cannot mention.

Sequences

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Catastrophic Regressional Goodhart
7Thomas Kwa's Shortform
6y
34
Wei Dai's Shortform
Thomas Kwa10d42

Agree that your research didn't make this mistake, and MIRI didn't make all the same mistakes as OpenAI. I was responding in context of Wei Dai's OP about the early AI safety field. At that time, MIRI was absolutely being uncooperative: their research was closed, they didn't trust anyone else to build ASI, and their plan would end in a pivotal act that probably disempowers some world governments and possibly ends up with them taking over the world. Plus they descended from a org whose goal was to build ASI before Eliezer realized alignment should be the focus. Critch complained as late as 2022 that if there were two copies of MIRI, they wouldn't even cooperate with each other.

It's great that we have the FLI statement now. Maybe if MIRI had put more work into governance we could have gotten it a year or two earlier, but it took until Hendrycks got involved for the public statements to start.

Reply
Wei Dai's Shortform
Thomas Kwa11d*718

We absolutely do need to "race to build a Friendly AI before someone builds an unFriendly AI". Yes, we should also try to ban Unfriendly AI, but there is no contradiction between the two. Plans are allowed (and even encouraged) to involve multiple parallel efforts and disjunctive paths to success.

Disagree, the fact that there needs to be a friendly AI before an unfriendly AI doesn't mean building it should be plan A, or that we should race to do it. It's the same mistake OpenAI made when they let their mission drift from "ensure that artificial general intelligence benefits all of humanity" to being the ones who build an AGI that benefits all of humanity.

Plan A means it would deserve more resources than any other path, like influencing people by various means to build FAI instead of UFAI.

Reply
Wei Dai's Shortform
Thomas Kwa11d43

Also mistakes, from my point of view anyway

  • Attracting mathy types rather than engineer types, resulting in early MIRI focusing on less relevant subproblems like decision theory, rather than trying lots of mathematical abstractions that might be useful (e.g. maybe there could have been lots of work on causal influence diagrams earlier). I have heard that decision theory was prioritized because of available researchers, not just importance.
  • A cultural focus on solving the full "alignment problem" rather than various other problems Eliezer also thought to be important (eg low impact), and lack of a viable roadmap with intermediate steps to aim for. Being bottlenecked on deconfusion is just cope, better research taste would either generate a better plan or realize that certain key steps are waiting for better AIs to experiment on
  • Focus on slowing down capabilities in the immediate term (e.g. plans to pay ai researchers to keep their work private) rather than investing in safety and building political will for an eventual pause if needed
Reply
Daniel Kokotajlo's Shortform
Thomas Kwa2mo32

The Interwebs seem to indicate that that's only if you give it a laser spot to aim at, not with just GPS.

Good catch.

Agree grenade sized munitions won't damage buildings, I think the conversation is drifting between FPVs and other kinds of drones, and also between various settings, so I'll just state my beliefs.

  • Fiber FPVs with 40km range and 2kg payload (either kamikaze or grenade/mine dropping), which can eventually be countered by a large number of short range guns if they fly at low altitude. It's not clear to me if the 40km range ones need to be larger
  • Heavy bomber drones can be equipped with fiber (or Starlink for US allies) and carry 15kg+ payload, enough to damage buildings and sensitive industrial equipment. They can do this while flying above the range of small guns and need dedicated antiaircraft guns
  • Fixed wing can carry even larger payloads with longer range and higher altitude, but are still pretty slow, except for the ones with jet engines
  • Drones equipped with GPS will know their position to within ~10 meters like the GPS only variant of Excalibur. It seems possible to constrain the payload's horizontal velocity by 1 m/s on average, and the drop time from 1500m is 17 seconds, giving an error of 17 m. The overall error would be sqrt(10^2 + 17^2) = 20 m. If GPS is jammed, it's not obvious they can do the first part, but probably they can still use cameras or something
  • All of the above are extremely threatening for both organized warfare and terrorism against an opponent without effective cheap air defense.
  • Even with the next evolution of air defense including radar-acoustic fusion to find threats, the limited reliability of ~all types of existing air defense and large number of drone configurations makes me guess that drones will remain moderately threatening in some form. Given that Hezbollah was previously firing unguided rockets with CEP in the hundreds of meters, some kind of drone that can target with CEP around 20 meters could be more cost effective for them if they cannot procure thousands of cheap guided missiles. If they could drop six individual grenades on six people from a bomber drone even in the presence of air defense, that would be even more effective, but it seems unlikely
  • Excalibur is made by the US, which has no incentive to reduce costs, and so its $70k price tag is more of a "maximum the army is willing to pay" situation. This is true to some extent with Skyranger so maybe someone motivated will build smart ammunition that costs $40 per round and make it cost effective.
Reply
Daniel Kokotajlo's Shortform
Thomas Kwa2mo32

Nor would I. In WWII bombers didn't even know where they were, but we have GPS now such that Excalibur guided artillery shells can get 1m CEP. And the US and possibly China can use Starlink and other constellations for localization even when GPS is jammed. I would guess 20m is easily doable with good sensing and dumb bombs, which would at least hit a building.

Reply
Daniel Kokotajlo's Shortform
Thomas Kwa2mo62

IMO it is too soon to tell whether drone defense will hold up to countercountermeasures.

  • It's already very common for drones to drop grenades and they can theoretically do so from 1-2km up if you sacrifice precision.
  • Once sound becomes the primary means of locating drones, I expect the UAS operators to do everything they can to mask the acoustic signature, including varying the sound profile through propeller design, making cheaper drones louder than high-end drones, and deployable acoustic decoys.
  • Guns have short range so these only work to defend targets fairly close to the system. E.g. Ukraine's indigenous Sky Sentinel (12.7mm caliber) has a range of 1.5 km and a sufficiently large swarm of FPVs can overwhelm one anyway. For longer ranges, larger calibers are needed, and these have higher costs and lower rate of fire. Skyranger 30mm has a range of 3 km but the ammunition costs $hundreds per round.

I agree that Israel will probably be less affected than larger, poorer countries, but given that drones have probably killed over 200,000 people in Ukraine even a small percentage of this would be a problem for Israel.

Reply
ryan_greenblatt's Shortform
Thomas Kwa2mo20

The data is pretty low-quality for that graph because the agents we used were inconsistent and Claude 3-level models could barely solve any tasks. Epoch has better data for SWE-bench Verified, which I converted to time horizon here and found to also be doubling every 4 months ish. Their elicitation is probably not as good for OpenAI as Anthropic models, but both are increasing at similar rates.

Reply1
Thought Anchors: Which LLM Reasoning Steps Matter?
Thomas Kwa3mo32

How does your work differ from Forking Paths in Neural Text Generation (Bigelow et al.) from ICLR 2025?

Reply
Daniel Kokotajlo's Shortform
Thomas Kwa3mo74

I talked to the AI Futures team in person and shared roughly these thoughts:

  • Time horizon as measured in the original paper is underspecified in several ways.
  • Time horizon varies by domain and AI companies have multiple types of work. It is not clear if HCAST time horizon will be a constant factor longer than realistic time horizon, but it's a reasonable guess.
  • As I see it, task lengths for time horizon should be something like the average amount of labor spent on each instance of a task by actual companies, and all current methodologies are approximations of this.
  • To convert time horizon to speedup, you would need to estimate the average labor involved in supervising an AI on a task that would take a human X hours and the AI can do with reliability Y, which we currently don't have data on.
  • As I see it, time horizon is in theory superexponential, as it has to go to infinity when we get AGI / superhuman coder. But the current data is not good enough to just fit a superexponential and get a timelines forecast. It could already be superexponential, or it could only go superexponential after time horizon hits 10 years.
  • Cursor and the Claude Code team probably already have data that tracks the speed of generations, plus how long humans spend reading AI code, correcting AI mistakes, and supervising AI in other ways, that one could construct a better forecast from.
  • It is also unclear what speedup an AI with infinite software time horizon would bring to AI R&D, because this would depend on its speed at doing existing tasks, how many useful novel tasks it invents that humans can't do, and ability to interface with non-software parts of the business.
Reply
Daniel Kokotajlo's Shortform
Thomas Kwa4mo*44

One possible way things could go is that models behave like human drug addicts, and don't crave reward until they have an ability to manipulate it easily/directly, but as soon as they do, lose all their other motivations and values and essentially become misaligned. In this world we might get

  • Companies put lots of effort into keeping models from nascent reward hacking, because reward hacking behavior will quickly lead to more efficient reward hacking and lead to a reward-addicted model
  • If the model becomes reward-addicted and the company can't control it, they might have to revert to an earlier checkpoint, costing 100s of millions
  • If companies can control reward-addicted models, we might get some labs that try to keep models aligned for safety + to avoid the need for secure sandboxes, and some that develop misaligned models and invest heavily in secure sandboxes
  • It's more likely we can tell the difference between aligned and reward-addicted models, because we can see behavioral differences out of distribution
Reply
Load More
24Claude, GPT, and Gemini All Struggle to Evade Monitors
3mo
0
16The murderous shortcut: a toy model of instrumental convergence
1y
0
5Goodhart in RL with KL: Appendix
1y
0
27Catastrophic Goodhart in RL with KL penalty
1y
0
27Thomas Kwa's research journal
2y
0
12Catastrophic Regressional Goodhart: Appendix
2y
1
62When is Goodhart catastrophic?
2y
15
15Challenge: construct a Gradient Hacker
3y
2
16Failure modes in a shard theory alignment plan
3y
1
7Thomas Kwa's Shortform
6y
34
Load More