Comments

Our findings reveal a robustness gap for LLMs in medical reasoning, demonstrating that evaluating these systems requires looking beyond standard accuracy metrics to assess their true reasoning capabilities.[6](https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2837372#zld250161r6) When forced to reason beyond familiar answer patterns, all models demonstrate declines in accuracy, challenging claims of artificial intelligence’s readiness for autonomous clinical deployment.

A system dropping from 80% to 42% accuracy when confronted with a pattern disruption would be unreliable in clinical settings, where novel presentations are common. The results suggest that these systems are more brittle than their benchmark scores suggest.