A new paper in Nature Medicine reports that leading general-purpose AI models beat specialized medical systems on multiple benchmarks. The comparison tested OpenEvidence and UpToDate Expert AI against GPT, Gemini and Claude across three stages, 500 medical knowledge questions modeled on U.S. licensing exams, 500 tasks measuring alignment with physicians’ judgments, and 100 real questions posed by clinicians in daily work.
The result was consistent: the general models came out ahead on every measure. In the MedQA test, Gemini reached 97.4% accuracy and GPT 94.2%, while OpenEvidence and UpToDate were both around 89%. On real-world clinical questions, physicians also rated the general models higher for clinical correctness, completeness and clarity.
The article argues that the advantage may come from breadth rather than narrow specialization. It says larger models benefit from massive investment, rapid improvement cycles, hundreds of millions of users and constant validation across tasks ranging from coding to scientific explanations. The author adds that trying to force models to rely only on guidelines and a fixed body of medical literature did not necessarily improve results, and sometimes made them worse.
The piece cites a separate example in which a general OpenAI system helped solve a long-standing Erdős distance problem by spotting unexpected links between ideas. It concludes that the key question is not whether a system is branded as medical AI, but whether it is more accurate, safer and more useful in the real world. The article is signed by Prof. Ido Wolf, head of the oncology division at Ichilov Medical Center and head of the medical school at Tel Aviv University.