Peer-reviewed research

Research that shapes how we build.

Peer-reviewed publications at the intersection of large language models and clinical medicine.

11 publications
3 years of output
Nature · JAMIA · ACL · PLOS
2026 2 publications
Case Study

Implémentation d'un chatbot dans le dossier patient informatisé

M Griot, A Irrthum, J Vanderdonckt, D Yuksel

Actes de la journée d'étude sur l'utilisation des LLM à l'hôpital · 47, 2026

A case study documenting our in-house deployment of an LLM-powered clinical assistant inside a university-hospital EHR — from pilot to full hospital production. We describe the system's technical design and the observed usage patterns: rapid, sustained adoption by clinical staff, with primary use cases centered on medical-information retrieval and patient-record summarization.

Thesis

A Methodology for Developing and Integrating Large Language Models into Electronic Health Records to Support Clinical Workflows

M Griot

Université Catholique de Louvain · Doctoral Thesis, 2026

A doctoral thesis establishing an end-to-end methodology for developing and integrating large language models into clinical workflows with patient safety as a primary constraint. It brings together the internistai-7b-v0.2 model, the Glianorex and MetaMedQA benchmarks, and the in-house EHR chatbot deployed to more than 1,000 clinicians — emphasizing clinician involvement throughout and adaptation to the local medical context.

2025 7 publications
Journal Highlighted

Large language models lack essential metacognition for reliable medical reasoning

M Griot, C Hemptinne, J Vanderdonckt, D Yuksel

Nature Communications · 16 (1), 642, 2025

Large language models achieve expert-level accuracy on medical board exams — but we show they lack a critical ability for clinical practice: metacognition. We introduce MetaMedQA, a benchmark that adds confidence scoring and unknown-answer recall to medical multiple-choice questions, and evaluate twelve models across those dimensions. All models fail to recognize the limits of their own knowledge and remain confident even when no correct option is available, revealing a dangerous disconnect between perceived and actual clinical capabilities.

Conference SAC Highlight

Pattern recognition or medical knowledge? The problem with multiple-choice questions in medicine

M Griot, J Vanderdonckt, D Yuksel, C Hemptinne

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL) · 2025

Medical LLMs are routinely benchmarked on USMLE-style multiple-choice questions — but are those scores actually measuring medical knowledge? We build a fictional benchmark centered on an imaginary organ, the Glianorex, so that genuine reasoning can be cleanly separated from memorization. Despite fully fictional content, models average 64% while physicians score 27%; ablation and interpretability analyses show models relying on shallow cues, test-taking heuristics, and hallucinated reasoning rather than clinical understanding.

Journal Highlighted

Implementation of large language models in electronic health records

M Griot, J Vanderdonckt, D Yuksel

PLOS Digital Health · 4 (12), e0001141, 2025

We deployed an on-premises, GDPR-compliant LLM assistant (Qwen3-235B with Retrieval Augmented Generation) directly inside the Epic EHR at a European university hospital. A one-month pilot with 28 physicians saw 64% daily use, and the subsequent hospital-wide rollout reached 1,028 clinicians and 14,910 conversations over five months — demonstrating that large-scale clinical LLM integration is technically feasible and sustains real usage when embedded in workflows and governed by strong privacy safeguards.

Conference

Physician in the Loop Design of Interactive Agents

M Griot, J Vanderdonckt, D Yuksel, C Hemptinne

Engineering Interactive Computer Systems (EICS 2024 International Workshops) · 2025

Technology development for medicine often happens without meaningful physician input. We show that a single one-hour interactive meeting is enough to equip physicians with sufficient understanding of interactive agents to propose relevant, feasible ideas for their own clinical work — a low-cost, practical way to close the expertise gap between computer scientists and clinicians.

Journal

A hybrid deployment model for generative artificial intelligence in hospitals

M Griot, C Hemptinne, J Vanderdonckt, D Yuksel

Machine Learning: Health · 1 (1), 013001, 2025

Vendor-provided generative AI tools rarely account for the evaluation, adaptation, and oversight requirements of clinical practice. We propose a hybrid framework that uses vendor models for non-medical applications while running hospital-managed infrastructure for clinical use cases — balancing innovation with patient safety, mitigating biases, ensuring regulatory compliance, and supporting long-term operational stability.

Journal

A patient-in-the-loop approach to artificial intelligence in medicine

MF Griot, GA Walker

JAMA Network Open · 8 (6), e2514460, 2025

Most discussions of clinical AI stop at the human-in-the-loop paradigm, but we argue the patient also belongs in the loop. Drawing on evidence that patients prefer less accurate but explainable systems, we make the case that AI should augment the patient–physician relationship rather than replace it — with both patients and clinicians shaping AI development so that systems align with what people actually want from healthcare technology.

Book Chapter

La régulation de l'utilisation de l'intelligence artificielle en milieu hospitalier

A Gobert, M Rappe, M Griot

Droit hospitalier: décodage juridique au départ des réalités hospitalières · 2025

A legal and regulatory analysis of artificial intelligence in hospital settings. We examine the current regulatory landscape — GDPR, medical-device regulation, and liability — as it applies to clinical AI, and propose governance frameworks that hospitals can use to deploy these systems responsibly while meeting their legal obligations.

2024 2 publications
Journal Highlighted

Impact of high-quality, mixed-domain data on the performance of medical language models

M Griot, C Hemptinne, J Vanderdonckt, D Yuksel

Journal of the American Medical Informatics Association (JAMIA) · 31 (9), 1875-1883, 2024

How should medical language models be trained? We show that a curated mix of high-quality domain-specific and general-domain data substantially outperforms models trained on larger but less-focused corpora (P < .001). Our 7B-parameter Med5 model reaches 60.5% on MedQA — up from the prior 49.3% best at comparable scale — becomes the first model of its size to pass the USMLE, and retains general-purpose competence, demonstrating that data quality, not parameter count, drives clinical utility.

Dataset

MetaMedQA benchmark code

M Griot, C Hemptinne, J Vanderdonckt, D Yuksel

Zenodo · 2024

Open-source release of the MetaMedQA benchmark: code, data, and evaluation harness for measuring metacognitive abilities of large language models on medical question-answering. Enables independent reproduction and extension of our Nature Communications findings on confidence calibration and unknown-answer recall.