AI ALIGNMENT FORUM
Petrov Day
AF

81
Thomas Kwa
Ω5663810
Message
Dialogue
Subscribe

Member of technical staff at METR.

Previously: Vivek Hebbar's team at MIRI → Adrià Garriga-Alonso on various empirical alignment projects → METR.

I have signed no contracts or agreements whose existence I cannot mention.

Sequences

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Catastrophic Regressional Goodhart
Daniel Kokotajlo's Shortform
Thomas Kwa10d32

The Interwebs seem to indicate that that's only if you give it a laser spot to aim at, not with just GPS.

Good catch.

Agree grenade sized munitions won't damage buildings, I think the conversation is drifting between FPVs and other kinds of drones, and also between various settings, so I'll just state my beliefs.

  • Fiber FPVs with 40km range and 2kg payload (either kamikaze or grenade/mine dropping), which can eventually be countered by a large number of short range guns if they fly at low altitude. It's not clear to me if the 40km range ones need to be larger
  • Heavy bomber drones can be equipped with fiber (or Starlink for US allies) and carry 15kg+ payload, enough to damage buildings and sensitive industrial equipment. They can do this while flying above the range of small guns and need dedicated antiaircraft guns
  • Fixed wing can carry even larger payloads with longer range and higher altitude, but are still pretty slow, except for the ones with jet engines
  • Drones equipped with GPS will know their position to within ~10 meters like the GPS only variant of Excalibur. It seems possible to constrain the payload's horizontal velocity by 1 m/s on average, and the drop time from 1500m is 17 seconds, giving an error of 17 m. The overall error would be sqrt(10^2 + 17^2) = 20 m. If GPS is jammed, it's not obvious they can do the first part, but probably they can still use cameras or something
  • All of the above are extremely threatening for both organized warfare and terrorism against an opponent without effective cheap air defense.
  • Even with the next evolution of air defense including radar-acoustic fusion to find threats, the limited reliability of ~all types of existing air defense and large number of drone configurations makes me guess that drones will remain moderately threatening in some form. Given that Hezbollah was previously firing unguided rockets with CEP in the hundreds of meters, some kind of drone that can target with CEP around 20 meters could be more cost effective for them if they cannot procure thousands of cheap guided missiles. If they could drop six individual grenades on six people from a bomber drone even in the presence of air defense, that would be even more effective, but it seems unlikely
  • Excalibur is made by the US, which has no incentive to reduce costs, and so its $70k price tag is more of a "maximum the army is willing to pay" situation. This is true to some extent with Skyranger so maybe someone motivated will build smart ammunition that costs $40 per round and make it cost effective.
Reply
Daniel Kokotajlo's Shortform
Thomas Kwa10d32

Nor would I. In WWII bombers didn't even know where they were, but we have GPS now such that Excalibur guided artillery shells can get 1m CEP. And the US and possibly China can use Starlink and other constellations for localization even when GPS is jammed. I would guess 20m is easily doable with good sensing and dumb bombs, which would at least hit a building.

Reply
Daniel Kokotajlo's Shortform
Thomas Kwa11d62

IMO it is too soon to tell whether drone defense will hold up to countercountermeasures.

  • It's already very common for drones to drop grenades and they can theoretically do so from 1-2km up if you sacrifice precision.
  • Once sound becomes the primary means of locating drones, I expect the UAS operators to do everything they can to mask the acoustic signature, including varying the sound profile through propeller design, making cheaper drones louder than high-end drones, and deployable acoustic decoys.
  • Guns have short range so these only work to defend targets fairly close to the system. E.g. Ukraine's indigenous Sky Sentinel (12.7mm caliber) has a range of 1.5 km and a sufficiently large swarm of FPVs can overwhelm one anyway. For longer ranges, larger calibers are needed, and these have higher costs and lower rate of fire. Skyranger 30mm has a range of 3 km but the ammunition costs $hundreds per round.

I agree that Israel will probably be less affected than larger, poorer countries, but given that drones have probably killed over 200,000 people in Ukraine even a small percentage of this would be a problem for Israel.

Reply
ryan_greenblatt's Shortform
Thomas Kwa23d20

The data is pretty low-quality for that graph because the agents we used were inconsistent and Claude 3-level models could barely solve any tasks. Epoch has better data for SWE-bench Verified, which I converted to time horizon here and found to also be doubling every 4 months ish. Their elicitation is probably not as good for OpenAI as Anthropic models, but both are increasing at similar rates.

Reply1
Thought Anchors: Which LLM Reasoning Steps Matter?
Thomas Kwa1mo32

How does your work differ from Forking Paths in Neural Text Generation (Bigelow et al.) from ICLR 2025?

Reply
Daniel Kokotajlo's Shortform
Thomas Kwa1mo74

I talked to the AI Futures team in person and shared roughly these thoughts:

  • Time horizon as measured in the original paper is underspecified in several ways.
  • Time horizon varies by domain and AI companies have multiple types of work. It is not clear if HCAST time horizon will be a constant factor longer than realistic time horizon, but it's a reasonable guess.
  • As I see it, task lengths for time horizon should be something like the average amount of labor spent on each instance of a task by actual companies, and all current methodologies are approximations of this.
  • To convert time horizon to speedup, you would need to estimate the average labor involved in supervising an AI on a task that would take a human X hours and the AI can do with reliability Y, which we currently don't have data on.
  • As I see it, time horizon is in theory superexponential, as it has to go to infinity when we get AGI / superhuman coder. But the current data is not good enough to just fit a superexponential and get a timelines forecast. It could already be superexponential, or it could only go superexponential after time horizon hits 10 years.
  • Cursor and the Claude Code team probably already have data that tracks the speed of generations, plus how long humans spend reading AI code, correcting AI mistakes, and supervising AI in other ways, that one could construct a better forecast from.
  • It is also unclear what speedup an AI with infinite software time horizon would bring to AI R&D, because this would depend on its speed at doing existing tasks, how many useful novel tasks it invents that humans can't do, and ability to interface with non-software parts of the business.
Reply
Daniel Kokotajlo's Shortform
Thomas Kwa2mo*44

One possible way things could go is that models behave like human drug addicts, and don't crave reward until they have an ability to manipulate it easily/directly, but as soon as they do, lose all their other motivations and values and essentially become misaligned. In this world we might get

  • Companies put lots of effort into keeping models from nascent reward hacking, because reward hacking behavior will quickly lead to more efficient reward hacking and lead to a reward-addicted model
  • If the model becomes reward-addicted and the company can't control it, they might have to revert to an earlier checkpoint, costing 100s of millions
  • If companies can control reward-addicted models, we might get some labs that try to keep models aligned for safety + to avoid the need for secure sandboxes, and some that develop misaligned models and invest heavily in secure sandboxes
  • It's more likely we can tell the difference between aligned and reward-addicted models, because we can see behavioral differences out of distribution
Reply
Distillation Robustifies Unlearning
Thomas Kwa2mo10

One setting that might be useful to study is the one in Grebe et al., which I just saw at the ICML MUGen workshop. The idea is to insert a backdoor that points to the forget set; they study diffusion models but it should be applicable to LLMs too. It would be neat if UNDO or some variant can be shown to be robust to this-- I think it would depend on how much noising is needed to remove backdoors, which I'm not familiar with.

Reply
Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety
Thomas Kwa2mo30

Monitoring seems crucial for the next safety case. My guess is that in couple of years we'll lose capability-based safety cases, and need to rely on a good understanding of the question "how difficult are the tasks frontier agents can do covertly, without the human or AI monitor detecting its reasoning?".

Having said this, it may be that in a year or two, we'll be able to translate neuralese or have a good idea of how much one can train against a monitor before CoT stops being transparent, so the current paradigm of faithful CoT is certainly not the only hope for monitoring.

Reply
When is Goodhart catastrophic?
Thomas Kwa3mo20

Cassidy Laidlaw published a great paper at ICLR 2025 that proved (their Theorem 5.1) that (proxy reward - true reward) is bounded given a minimum proxy-true correlation and a maximum chi-squared divergence on the reference policy. Basically, chi-squared divergence works where KL divergence doesn't.

Using this in practice for alignment is still pretty restrictive-- the fact that the new policy can’t be exponentially more likely to achieve any state than the reference policy means this will probably only be useful in cases where the reference policy is already intelligent/capable.

Reply
Load More
7Thomas Kwa's Shortform
6y
34
24Claude, GPT, and Gemini All Struggle to Evade Monitors
2mo
0
16The murderous shortcut: a toy model of instrumental convergence
1y
0
5Goodhart in RL with KL: Appendix
1y
0
27Catastrophic Goodhart in RL with KL penalty
1y
0
27Thomas Kwa's research journal
2y
0
12Catastrophic Regressional Goodhart: Appendix
2y
1
62When is Goodhart catastrophic?
2y
15
15Challenge: construct a Gradient Hacker
3y
2
16Failure modes in a shard theory alignment plan
3y
1
7Thomas Kwa's Shortform
6y
34
Load More