Performance Evaluation of GPT-5, Grok 4, and DeepSeek R1 in Interpreting Complete Blood Count Reports for Hematologic Diseases: Retrospective Comparative Study.
Current frontier AI models show mixed reliability for interpreting blood tests, offering clinicians concrete data for safe deployment in hematology practice.
This retrospective comparative study benchmarks the three most capable publicly available AI language models (GPT-5, Grok 4, DeepSeek R1) against clinical reference standards for CBC interpretation in hematologic disease. The study provides the first head-to-head data for the current frontier generation, directly informing deployment decisions for AI-assisted CBC review in hematology practice.
What the study was
- Study design
- Retrospective comparative study
- Population
- Patients with hematologic diseases; CBC reports as input
- Category
- Diagnostics
- Maturity
- Validated
- Journal
- Journal of medical Internet research
Why it surfaced
First benchmark of GPT-5/Grok 4/DeepSeek R1 on CBC-hematology tasks — directly addresses watchlist T2 (CBC+ML) and T4 (AI diagnostics). CBC is the most ordered lab test globally; LLM-assisted interpretation at this capability level has immediate deployment relevance. J Med Internet Res is a leading clinical informatics journal.
A plain-language summary of published research — not medical advice. Talk to a clinician about your own care.