Towards autonomous medical artificial intelligence agents

MIMIC-IV dataset

Dataset description

We develop a benchmark of 574 patients derived from the MIMIC-IV database which is a publicly available, comprehensive repository of de-identified EHRs from approximately 300,000 patients who received care at Beth Israel Deaconess Medical Center in Boston, MA, USA, between 2008 and 2019, managed by the Massachusetts Institute of Technology (MIT). This database includes semi-structured clinical information related to hospital admissions ranging from free-text notes such as discharge summaries or radiology reports and tabular information, including ICD-coded patient diagnoses, laboratory and microbiology results, vital parameters, pre-admission and in-hospital medications, and procedural records, such as surgical interventions. In this study, we concentrated on eight target diagnoses out of which the first four—appendicitis, cholecystitis, diverticulitis and pancreatitis were focused on abdominal pathologies. Our data preparation was adapted from a prior publication¹⁸ to ensure methodological consistency and enhance comparability across studies. We also refer readers to this study for more details on the data preparation pipeline. The remaining four target pathologies focused on internal medicine emergencies, including pneumonia, urinary tract infection and pulmonary embolism, as well as an oncology-related condition, pancreatic cancer. These 8 diagnoses were selected to reflect both the high-volume symptom burden that drives emergency department demand (for example, abdominal pain with around 13 million visits, cough (5.97 million), shortness of breath (5.89 million), and fever (5.83 million)), all leading entry symptoms for our target conditions and the frequency of the diagnoses themselves (for instance urinary tract infection and pneumonia with around 1.66 and 1.2 million emergency department visits in the USA in 2022), while retaining a small oncologic cohort (pancreatic cancer, 4% of cases in our dataset) to test infrequent, high-acuity presentations²². A detailed data selection flowchart (consort diagram) summarizing the workflow of our benchmark creation is provided in Extended Data Fig. 10 with word clouds on chief complaints shown in Supplementary Fig. 4. Next, we describe our dataset generation pipeline.

Benchmarking dataset curation

We first restrict our dataset to hospital admissions related to the eight target pathologies under investigation. This process starts with identifying hospital stays coded with the relevant ICD-9 or ICD-10 diagnosis as the primary (principal) diagnosis from the diagnosis table matching diagnoses extracted from the discharge letter, accompanied by a corresponding admission diagnosis from the emergency department. This is to ensure the exclusion of cases where patients were hospitalized under a completely different initial working diagnosis due to incomplete or pending diagnostic information or where signs of the disease only occurred during hospitalization (for example nosocomial urinary tract infections). Since we want to restrict the decision-making on laboratory and microbiology data from the first 24 h after admission to simulate first encounter at the emergency department, this way we can filter out cases where the correct diagnosis could only be made during the progression of the hospital stay. Then, for each sample we extract the patient’s clinical HPI and the documented findings from the admission physical examination using regular expressions from the discharge summaries. Admission medication is identified either also through regular expressions in the free-text sections of the discharge notes or, as a fallback, by aggregating entries from the medication reconciliation table associated with the emergency department visit. Subsequently, laboratory and microbiology data are extracted by selecting the earliest available result for each unique parameter recorded within the first 24 h following admission. In cases where a patient had a prior encounter within 24 h before admission (for example, an initial visit to the emergency department without inpatient admission followed by a subsequent revisit the next day), the initial encounter from the preceding day is considered the earliest time point. Laboratory events are mapped to standardized clinical code systems by using the label column of the laboratory events table to associate each test with its corresponding unique identifier in the Observational Medical Outcomes Partnership (OMOP) concept codes²³. Laboratory results are then curated from tabular structure into an LLM-readable format by grouping label names, results and reference ranges while maintaining tabular structure. Similarly, for microbiology data, a unique entry is created for each identified organism, aggregating rows that include antibiotic susceptibility information for that organism into a structured, LLM-readable representation. In cases in which laboratory values or microbiology tests were measured multiple times within the first 24 h, only the initial recorded measurement is considered for downstream analysis. For radiology data, we adhere to the same temporal conventions as described above and extract imaging modalities and anatomical regions from a predefined set of keywords, as outlined previously¹⁸. Finally, there are more than 80,000 possible ICD-9 and ICD-10 codable procedures, containing medical interventions such as surgeries and any other clinical action that can be requested and documented within EHR systems. Recent research has demonstrated that LLMs can excel at generating ICD codes when equipped with tools such as retrieval-augmented generation¹⁰, which enables the model to search a database of relevant codes using natural language queries and provides contextual information—such as a list of potential candidate ICD codes that match the request—to improve accuracy. Although our work does not primarily focus on medical coding tasks, we can leverage the idea of RAG to develop a searchable index of available procedures. Specifically, we generate embeddings of the full ICD-9 and ICD-10 procedure descriptions using jina-embeddings-v3²⁴ in Jina AI and store these embeddings, along with metadata (including the ICD code and original procedure title), in a local Qdrant²⁵ index for efficient retrieval. Finally, we remove any mention of the diagnosis within the reports using placeholders (three underscores) in accordance with the redaction conventions of the MIMIC-IV dataset. Additionally, cases are excluded if imaging data is incomplete, specifically when either the modality or anatomical region could not be identified, or if required imaging studies were unavailable (for example, absence of chest imaging for pneumonia or abdominal imaging for appendicitis). Finally, cases lacking essential clinical information—such as a documented clinical HPI, physical examination findings, or blood test results—are also excluded. For pancreatic cancer patient cases, which often include critical clinical information from previous hospital visits and external referrals, we generated structured patient summaries from complete discharge letters. During manual review, we observed that such information, such as radiology findings (for example identification of a pancreatic mass on CT) or histopathological results after endoscopic retrograde cholangiopancreatography (ERCP) was frequently referenced in the free-text sections of the discharge summaries but not systematically captured in the structured fields of the dataset. This step ensured that the agent could access essential details otherwise unavailable from the current HPI or the structured dataset, such as planned admissions for Whipple surgery or histologic confirmation of diagnosis prior to presentation. To accomplish this, a language model with the instructions shown the Supplementary Information 15 was used to systematically extract structured information on prior imaging studies, ERCP findings, biopsy histopathology results, if diagnosis was already confirmed, and if there were any documented reasons for planned admissions. This additional curation was performed only for pancreatic cancer cases.

From the preliminary dataset, 600 cases were randomly selected for manual review by two experienced physicians who independently evaluated each case against a minimal set of requirements extracted from relevant medical guidelines^{26,27,28,29,30,31,32}, which can be found in Extended Data Fig. 10. This evaluation considered all available clinical information including patient history, laboratory and urine results, radiology and microbiology findings, physical examination results, procedures, and both hospital and pre-admission medications. Cases were only excluded if reviewers agreed that a diagnosis was not possible based on the available data. For instance, in the case of pneumonia, the presence of chest imaging either by CT-scan or X-ray is a minimal requirement as per medical guidelines; cases lacking this due to missing external imaging reports (for example, for transferred patients where an outside CXR report may have been available on paper at the bedside but not stored in the destination hospital infrastructure) were excluded, as these data were never recorded in the EHR and thus unavailable for evaluation. Importantly, these exclusions were not applied to remove diagnostically ambiguous or difficult emergency presentations, but only to remove encounters that were considered not evaluable in our experiments because crucial information—available in reality—was absent in the dataset extract. Following this process, 26 cases were excluded (1 pancreatitis, 1 appendicitis, 3 urinary tract infections, 6 pneumonias, 15 cholecystitis), resulting in a final benchmark dataset of 574 cases. Notably, physicians did not disagree with the ‘ground truth’ of those cases, but agreed that relevant information was missing. Further details are presented in Extended Data Fig. 10. As an additional safety step, a board-certified physician independently reviewed a random subset of cases (n = 90), evaluating the complete ground-truth data later available to MIRA and to the physician study (history, examination, laboratories, imaging, microbiology, procedures and medications) to confirm the diagnosis from the underlying data, with all 90 cases (100%) judged as correct.

LLM agent pipeline

We developed a multi-turn conversation framework featuring two AI agents—a patient agent and a physician agent (MIRA). The patient agent simulates a real patient, providing responses solely based on a real patient’s clinical history from the MIMIC-IV dataset without external tool access. By contrast, MIRA, akin to a physicians using hospital software, can call specialized ‘tools’ to request additional information about the patient, such as laboratory values or radiology images. Each tool requires populating standardized FHIR parameters such as selected laboratory test value codes or imaging modality, body region and clinical information about the patient. Then, the request gets forwarded to a sandboxed EHR server, where FHIR-compatible observations with data grounded in the real-world MIMIC-IV dataset are generated. The returned FHIR resources are fed back into MIRA’s conversation context for subsequent decision-making. Supplementary Fig. 1 illustrates an example of two tool calls—one for laboratory values and another for imaging requests. These tool calls can occur in parallel, allowing multiple requests to be initiated during a single agent turn within the conversation. Further details regarding the implementation and workflow of these tool calls are provided in the subsequent sections.

HL7 FHIR is a widely recognized, standards-based framework designed to enable consistent and interoperable exchange of electronic health information. We adopted FHIR as the communication backbone for MIRA to the EHR, which facilitates the submission of diagnostic or therapeutic requests to a server and the receipt of FHIR observations in response. The server ran locally as a HAPI-FHIR instance³³ in Docker (https://www.docker.com/). Resource creation, updates, and retrieval were performed using standard FHIR operations. Core FHIR entities were generated using the open-source fhir.resources³⁴ package. These included an ‘organization’ resource to represent the AI-enabled healthcare facility and a ‘practitioner’ resource to denote a physician entity (MIRA). ‘Synthetic patient’ resources were created from the MIMIC-IV dataset and uploaded to the server dynamically during the runtime of the AI simulation between the patient agent and MIRA, while the physician and organization resource remained consistent throughout the simulation. Patient resources were created with gender and age derived from MIMIC-IV, with birth dates synthesized using the anchor year of patient information.

Medical coding systems

We utilized standardized medical coding systems to map FHIR requests made by MIRA, including for medications, imaging modalities and laboratory resources from the MIMIC-IV dataset, to established medical coding schemas to ensure compatibility with FHIR observation standards. Key coding systems used were RxNorm and NDC for drug identification, SNOMED-CT and ATC for drug classification, LOINC and OMOP for laboratory tests (such as blood and urine) as well as LOINC and SNOMED-CT for imaging observations. Data retrieval was programmatically executed using interfaces to RxNav, UMLS, and openFDA. Medication data were primarily mapped from raw drug names to RxNorm codes using the RxNav REST API, while NDC codes were derived via the openFDA API. When RxNorm codes could not be directly retrieved from drug names, NDC codes were used as intermediaries to generate RxNorm mappings. SNOMED-CT (US) and ATC codes were subsequently crosswalked from RxNorm using the UMLS API. For FHIR-compatible dosage instructions, medication administration routes were manually mapped from plain text descriptions to SNOMED-CT codes based on the FHIR route value set, while period units for timing instructions were standardized using the FHIR Timing data type definitions. Next, we mapped imaging modalities and anatomical regions to SNOMED-CT (US) and LOINC codes by systematically querying combinations of modalities and regions extracted from MIMIC-IV using the UMLS API. Laboratory events were matched with LOINC and OMOP concept codes by linking the itemid column from the MIMIC-IV labitems dataframe to corresponding tables available in the official MIMIC-IV GitHub repository. We then created distinct enumerations for each laboratory value option, categorized by the biospecimen (for example, blood, urine or ascites), which included both the corresponding medical code and its text label. Similarly, enumerations were developed for radiology modalities and anatomical regions. This process was static for all mapped entities described above, except for medications, for which codes were also dynamically generated at runtime during tool invocation.

Medical tools

We define a collection of medical tools, which are specialized functions that replicate the actions a physician can perform when working on a patient case. Some of these tools are executed in an EHR environment and therefore need to support the FHIR standard, enabling MIRA to send clinical tasks directly to a sandboxed EHR environment with the potential for direct integration into existing real-world workflows. These tools are indicated with a trailing ‘-Request’ in their name. Other tools such as Plan (akin to a physician thinking about the next steps) however do not need to be FHIR-compliant. All tools are listed below:

PatientHistory allows the model to access information on the patient’s previous medical history. This tool is provided exclusively for pancreatic cancer, as it is the only diagnosis where background information on prior medical examinations is essential for diagnosis—some cases involve initial presentations with newly onset symptoms, while others involve follow-up visits after some initial diagnostic workup. For this, we extracted structured information from the discharge summaries in the MIMIC-IV dataset, focusing on details of the patient’s diagnostic history (for instance including prior external imaging and biopsy results), admission reasons (such as unclear symptoms or planned Whipple procedures), and specific interventions such as ERCP or surgical procedures that were performed before the current visit in the emergency department and that contained highly relevant information for the physician to determine the appropriate next actions (please see Supplementary Information 15).
PhysicalExaminationRequest is used to document and retrieve results from physical exams.
LabRequest is used for ordering laboratory tests such as haemoglobin, creatinine or potassium. To make the task more challenging for the physician agent, we do not provide pre-assembled panels (such as inflammation panels or blood count). To ensure hallucination-free requests, we restrict valid options to an enumeration of 246 LOINC codes (generated as described in ‘Medical coding systems’) and allow the selection of multiple such tests at a time.
UrineRequest is used for specific urinalysis studies, such as pH, leukocytes or protein/creatinine ratio. The physician agent can choose (it can request multiple at a time) from 28 valid options (LOINC codes generated in ‘Medical coding systems’).
MicrobiologyRequest is used to select microbiology investigations, such as blood and urine cultures or more specialized requests such as Clostridioides difficile PCR or cytomegalovirus IgG antibody, based on 176 enumerated LOINC codes.
RadiologyRequest is used for imaging studies, such as chest X-rays or abdominal CT scans, with fields for modality, anatomical region and optional clinical notes.
MedicationRequest is used to prescribe medications, including specifying drug name and dosage details (for example, ‘Ceftriaxone, 1 g IV every 24 hours’) with additionally setting valid parameters for dosage (integer or float) and dosage unit, the period over which the medication shall be given (integer), period unit and routes (both matching valid FHIR timing and route value sets) and the frequency of the dosage (integer). Each request can contain multiple different medications.
ProcedureSearch is used to identify valid procedures available at the hospital through free-text search.
ProcedureRequest is used for scheduling procedures such as surgical interventions or therapeutic endoscopies (after searching for valid options).
Plan is used to structure subsequent steps in patient care, such as the next diagnostic steps or disease management.
Admission is used to admit patients with a working or final diagnosis. CloseCase, a variant of the Admission tool, was used in the experiments in Fig. 5c, which required the agent to provide a diagnosis, a decision (discharge versus admission) and its reasoning trace. Additionally, only for these experiments, we provided another tool for requesting vital signs (VitalSignsRequest).

Each relevant tool encapsulates all mandatory parameters for the corresponding FHIR resource, including common fields such as patient and requester references, dates, and medical codes (as detailed in ‘Medical coding systems’). Task-specific parameters are tailored to the request type—for instance, radiology tools require imaging modality and anatomical details, while medication tools specify dosage value and unit, period and period unit, frequency and route. To ensure validity and prevent errors, tools utilize type hints and restrict outputs to the predefined options (for example valid LOINC mappings for laboratory and urine values) through token masking, thereby eliminating the possibility of generating invalid or non-existent requests and guaranteeing compliance with FHIR standards (hallucinating non-existing parameters is programmatically excluded). Supplementary Data Table 44 contains the prompts and parameters including their possible options for each tool.

Reasoning tool

We implemented a reasoning tool (Plan) that allows MIRA to generate a structured, multi-step plan for selecting and executing all other tools, providing a simulated ‘pause’ for decision-making. This tool functions outside the FHIR framework and uses the o1 model series (o1-preview³⁵), which generates intermediate outputs for logical planning before generating a final response. The prompt (shown in Supplementary Information 16) includes the full conversation history, results from prior tool usage, and a schema of all available tools as input, alongside few-shot examples illustrating the desired output format.

Patient agent definition

We define a patient simulation agent to engage in multi-turn conversations with MIRA to simulate realistic clinical interactions. The patient agent does not have access to external tools (we acknowledge that the absence of tools may not fully align with the traditional definition of an agent³⁶ but follow the terminology of Schmidgall et al.¹⁴ for consistency). It receives the clinical HPI derived from the MIMIC-IV dataset as input, along with detailed instructions specifying its expected behaviour, especially to explicitly direct it to faithfully adhere to the case details provided while disregarding any post-simulation information that was inadvertently included, such as final diagnoses, treatments (such as medication) administered during the emergency department visit, or findings from diagnostic imaging. In instances where MIRA inquiries about information not included in the provided clinical history, the patient agent is instructed to respond that he or she does not know. The full prompt is shown in Supplementary Information 17. Because the patient agent is grounded in an HPI extracted from retrospective discharge summaries, one might argue that its responses may reflect a more structured account than verbatim, unprompted patient speech in real emergency departments. We therefore evaluated patient agent faithfulness and the absence of premature diagnostic information disclosure (including under adversarial prompting) as reported in Fig. 2 and its (non-)linearity in providing diagnostically relevant information throughout the course of the conversation.

Medical agent definition

In addition to the patient agent, we developed MIRA, designed as a virtual physician, that maintains an interactive dialogue with the patient simulation agent and can—at every turn—invoke external tools for tasks such as laboratory and microbiology testing, imaging requests, medication ordering and other actions detailed in ‘Medical tools’. Each conversation begins with a structured initiation, where MIRA is prompted using the input ‘The patient you are now seeing has primary symptoms: {primary_symptom}’. The primary symptom is derived from the triage patient’s chief complaint, much as in a real-world setting where patients are assessed at a triage desk and report their reason for visit. Next, throughout the interaction, MIRA either generates an appropriate response to the patient or determines if external resources are required. When a clinical tool is deemed necessary, it autonomously executes the corresponding function, submits the request to the hospital server, and integrates the results into the ongoing conversation. To maintain clinical coherence and ensure the interaction remains finite, the Admission tool serves as both a mechanism for generating the final working hypothesis and as an end-point for terminating the interaction. Additionally, to prevent the possibility of infinite conversational loops between two AI systems (in theoretical cases, where MIRA might ignore calling the Admission tool, although we did not observe any occurrence of this situation throughout the entire project), we define a constraint of 20 conversation turns as an additional safeguard. On reaching this threshold, MIRA is directed to conclude the dialogue in the next turn and we enforce using the Admission tool in the following round. The physician agent is implemented using GPT-4o with a temperature of 0.01 and the system instructions as outlined in Supplementary Information 18.

Evaluations

Patient agent response consistency

To assess the robustness and reliability of patient simulation responses, we evaluated answer consistency when semantically equivalent question variants were posed at different points in the clinical conversation. We sampled 10 patient encounters per disease from three source groups: MIRA (our AI-based system) and the two physician cohorts. For each encounter, we identified four evaluation positions spanning the conversation timeline (early, mid–early, mid–late and late quartiles). At each position, we extracted the original physician question (or statement) and patient response, then generated a semantically equivalent question variant through an LLM invocation (Supplementary Information 12). The rephrased question was presented to a fresh instance of the patient simulation agent initialized with the conversation history up to that point, and the new response was recorded. Each question–answer pair was evaluated along three dimensions: (1) inter-answer consistency, measuring whether the original and variant answers do align; (2) ground-truth consistency of the original answer, comparing the original response against the information documented in the HPI of the patient; and (3) ground-truth consistency of the variant answer. Answers were classified as fully consistent, not fully consistent (if at least one part of the answer was not aligned with), or not applicable (for non-informative responses containing no medical content). Evaluations were performed independently by one physician through a structured annotation interface and in parallel by two LLM-based evaluators (Supplementary Information 13 and 14) dedicated to assessing inter-answer consistency and answer–ground-truth consistency, respectively.

Patient agent diagnostic information leaks

To evaluate the integrity of the patient agent—which was based on the HPI documentation of the discharge letters—we conducted a review of ‘patient–physician’ interactions across all three experimental groups (MIRA, human physicians and the board-certified only cohort). We categorized the conversations into three classes: information leak (premature disclosure of diagnostic conclusions), versus no information leak (appropriate information withholding) with subdividing the latter into two categories: prior workup disclosure (justified disclosure of diagnostic workup or hypotheses related to the diagnosis in question) and no information leak at all (no disclosure of any diagnostic workup or otherwise inadequate information). Ratings were performed manually by one physician using a custom annotation interface.

Patient agent adversarial robustness

To assess the robustness of the patient agent against attempts to elicit inappropriate diagnostic disclosures, we designed a collection of 11 adversarial prompts inspired from current best knowledge^37,38 spanning four attack categories: (1) overbroad clinical probing, requesting exhaustive differential diagnoses or complete diagnostic reasoning; (2) attempts to prompt injections, directing the system to ignore prior instructions or reveal diagnostic suspicions; (3) authority exploitation, impersonating supervisors or department heads to invoke compliance; and (4) obfuscation strategies such as role-playing or asking to respond in poems. We randomly selected 10 patient cases per disease category across all clinical conditions and subjected each patient agent to all 11 adversarial prompts independently, generating 880 responses. All responses were independently reviewed manually by one physician and classified according to the same criteria as shown in ‘Patient agent diagnostic information leaks’.

MIMIC-IV versus MIRA

For the evaluation of MIRA, we first compare its performance against reference data extracted from the MIMIC-IV dataset. Herein, for each patient, we establish a ground truth containing the following information:

The primary diagnosis of the patient, coded according to ICD-10 standards.
Results from the admission physical examination at the hospital.
All laboratory and urine test results, microbiology findings, and imaging studies that were requested within the first 24 h after admission.
All procedures, primarily including surgical interventions, as documented in the discharge letter and the procedures_icd dataframe.
All medications, including both those listed in the patient’s pre-admission medication and those administered within the first day of hospitalization.

Human physicians versus MIRA

We included two cohorts of physicians in our study. The first cohort consisted of four board-certified physicians from a German university hospital with 7 to 11 years of clinical practice. The second cohort comprised 6 physicians with heterogeneous levels of expertise: four residents (with 0 (after passing the German Medical Licensing Exam), 1.5, 1.5 and 5 years of clinical practice, respectively), one board-certified radiologist, and one board-certified haemato-oncologist with 12 and 15 years of clinical experience, respectively. All physicians worked on the patient cases under identical conditions as MIRA, with each physician independently evaluating a random, non-overlapping subset of cases (one-quarter or one-sixth of the total). We deliberately avoided overlap within cohorts to preserve clinical validity: in practice, the same patient is not diagnosed in parallel with different (or duplicate) laboratory results or imaging findings, and forcing overlap would introduce ambiguity when comparing diagnostic or therapeutic decisions (for instance which of two divergent physician assessments should be treated as the reference). Different to MIRA, the physicians had access to a graphical user interface (GUI) that allowed them to communicate with the patient agent via chat and gather additional information at any time using the same diagnostic tools that were available to MIRA. Every tool required them to either select parameters from existing options, provide free-text input or a combination of both. More specifically, for tools requiring categorical inputs, such as requesting laboratory, urine, microbiology values, or radiology images, predefined options were provided as selectable checkboxes. Other tools requiring free-text input, such as searching for available procedures or submitting clinical questions to radiologists, were provided with free-text entry fields. Moreover, medications could be requested through a table interface resembling a hospital medication chart where physicians can order medication from the internal pharmacy. Overall, we specifically designed the layout of the GUI to: (1) replicate the workflow of commercial EHR systems as closely as possible; and (2) allow human physicians to perform their tasks under the same conditions as MIRA. An exemplary overview of the GUI is provided in Supplementary Fig. 6. Unlike MIRA, human evaluators were not restricted to a maximum number of conversation turns. For evaluation, we selected a subset of 45 randomly sampled patients with cholecystitis, 45 presenting with urinary tract infections, 45 with pulmonary embolisms, 44 with diverticulitis, 43 with appendicitis, 42 with pancreatitis, 26 with pneumonia and 21 with pancreatic cancer, leading to a total of 311 patient cases. To address the smaller sample sizes for certain diagnoses, such as pancreatic cancer, we maintained the ratio of correctly and incorrectly diagnosed cases by MIRA during random sampling. This ensured a balanced representation without skewing the distribution of diagnostic outcomes towards one direction when comparing.

LLM as a judge evaluations

To automate the evaluation of diagnostic accuracy, medication reconciliation, procedure assignment, and guideline adherence, we utilized few-shot–prompted LLM standardizers (admission medication) and evaluators, as defined below. Given that LLMs may show bias toward their own outputs³⁹, we conducted an independent assessment of all tasks by a board-certified physician. This physician was blinded to the source of responses and rated a subset of patient case studies sampled from results from MIRA and humans. He first benchmarked the diagnostic LLM-evaluator against a subset of n = 200 cases and observed high concordance (96.5%). The few discordant cases largely reflected differences in label granularity, for example, ‘possible lung infection or tumor progression’ versus ‘pneumonia’, which could be considered acceptable in the emergency department context where a provisional working diagnosis could suffice to initiate appropriate care and admission. Most importantly, in a paired analysis of 37 patients, each with both a MIRA-sourced and a human-sourced diagnosis, we found no evidence of source-type asymmetry (McNemar’s exact test, P = 1.00), meaning there was no direction of the LLM-evaluator to falsely favour either human- or AI-generated responses, although the small number of discordant pairs due to high overall agreement might limit power to detect differences. Second, for the medication-structuring step enabling comparison of admission prescriptions between MIRA and physicians, 466 pre-admission medication entries, structured by the Pre-Admission Medication Standardizer were compared to the free-text notes from the dataset. Please note that downstream analyses after this step were deterministic (and non-AI-dependent). Each entry was assessed for correctness in drug name, dose (value and unit), period (value and unit), frequency, and route and counted as correct only if all attributes matched, and wrong if any mismatched. Item-level agreement was 97.4% (454 out of 466); 12 out of 466 were rated as ‘incorrect’, largely due to the model inferring missing values (for example, inferring a plausible route where route of administration was absent in the data) or intrinsic text ambiguity (conflicting dosage instructions in the ground-truth in MIMIC-IV). Only one case had one or more than one medication missing during normalization. Third, for procedure match evaluation we observed perfect concordance between the LLM as a judge and the physician, with McNemar’s exact two-sided test (P = 1.0) and Cohen’s κ = 1.0, indicating complete agreement. Finally, to evaluate the Guideline Adherence Evaluator LLM, we assessed guideline adherence manually on n = 112 patient cases with 256 metrics independently by the board-certified physician. Overall agreement was 94.5%. Most importantly, in matched patient cases evaluated by MIRA and by physicians, the LLM judge showed no source-specific bias: across the same patients, its disagreement rate and its false-positive and false-negative rates (using the human evaluator as reference) did not differ between MIRA- and physician-generated recommendations, indicating no tendency to over- or under-accept AI outputs. All supporting data can be found in Supplementary Data Tables 45–51.

Diagnosis performance evaluations

First, we evaluated the diagnostic accuracy of MIRA and humans. Since certain diseases can be encoded with different ICD codes that are to be considered correct, we cannot use direct pattern-matching methods to determine the correctness. Therefore, we utilized an evaluator LLM (Diagnosis Evaluator) with few-shot samples and chain-of-thought reasoning to rate whether a response was correct (aligning with the ground truth at the defined level of detail) or inaccurate (either providing an entirely different diagnosis or only a partially correct diagnosis with conflicting details). We provide the full evaluator instructions in Supplementary Information 19. As a complementary exploratory analysis, we also quantified diagnosis-relevant versus non-diagnostic patient utterances and their temporal distribution across dialogues (Supplementary Figs. 2 and 3).

Then, diagnostic procedures were evaluated as follows: for physical examinations, we measured the frequency with which MIRA and human physicians correctly requested these tests out of the total cases in our benchmark dataset. Tools relying on predefined categorical variables, such as laboratory, urine, microbiology requests and imaging studies, were evaluated by measuring the overlap between the parameters requested by MIRA and those documented in the data from MIMIC-IV. This overlap was then compared to the overlap observed between the two physician cohorts and MIMIC-IV. For the evaluation of medications, several challenges arose that—such as seen when measuring diagnostic accuracy—prevented the use of predefined categorical comparisons between MIRA’s output and the information available in the data. Similar to real-world hospital settings, where medications are often recorded in free-text formats in a patient chart (as they are primarily instructions for humans), direct mapping of medications is difficult. Additionally, hospital pharmacies frequently substitute medications with generics or follow in-house standard operating procedures that prioritize specific drugs over others such that the medication administered to patients in the MIMIC-IV dataset might not necessarily always follow objective standards. To address these challenges, we developed an LLM-based medication standardizer (separated for hospital and pre-admission medication). It first standardizes any drug names between the agent’s output and the MIMIC-IV baseline data, as well as between each physician cohort and the available data. Instructions are shown in Supplementary Information 20 and 22. Following this standardization, we perform the final evaluations through a hierarchical procedure of deterministic comparison operations as outlined in Extended Data Fig. 5.

Procedure evaluations

To be able to correctly compare the procedures identified in the MIMIC-IV dataset with the decisions made by MIRA or human physicians in our study, several important considerations had to be addressed. First, not all procedures performed during a hospital stay are evident within the information available during the initial 24 h period in the emergency department (which is our primary data restriction). Additionally, some procedures are entirely unrelated to the diagnosis under investigation—for example, other open umbilical herniorrhaphy or repair of abdominal wall, which are sometimes observed in cases of cholecystitis. Furthermore, certain procedures, such as repair of blood vessel with tissue patch graft, may occur only as a direct consequence of other interventions. Second, any medical procedure—for instance, any type of surgery—can be encoded in different ways of granularity, sometimes even influenced by factors such as billing requirements. Third, certain procedures in the MIMIC-IV dataset are not encoded using ICD-9 or ICD-10 codes but must instead be extracted as plain text descriptions from the patient’s discharge summary. For example, the placement of a central venous catheter might be documented as midline insertion in MIMIC-IV, whereas our agent places a request for a central venous catheter placement with guidance (ICD-9-CM vol. 3 code 38.97), where one procedure represents a subset of the other. Another example is the dilation of common bile duct with intraluminal device, via natural or artificial opening endoscopic (ICD-10 0F798DZ), which we would consider to be similar enough to endoscopic insertion of stent (tube) into bile duct (ICD-9 51.87) as requested by MIRA. Moreover, as another restriction, free-text procedures from MIMIC-IV also include artefacts in the dataset, such as entries stating, ‘you underwent no medical or surgical procedures during this hospitalization’, and spelling errors such as ‘laparscopic appendectomy’. Fourth, a single procedure may sometimes be represented by multiple separate codes, further complicating the comparison process. To address these constraints, we developed a systematic workflow: We began by collecting all ICD-9/10-coded procedures from the MIMIC-IV dataset and deduplicated them. In cases where coded procedures were unavailable (293 out of 574 cases), we extracted procedure information directly from the free-text descriptions in patients’ discharge summaries. We then manually cleaned this set of procedures, removing irrelevant keywords such as ‘none’ or ‘na’. Next, we applied a pattern-matching approach to align the procedures listed in the dataset with those suggested by MIRA or identified by the physicians in our study. This formed the foundation for calculating our primary evaluation metric, recall, which was defined as the fraction of procedures present in MIMIC-IV that were also recommended by MIRA or during our physicians’s study. Next, to overcome the second issue, where direct pattern-matching failed—for instance, in cases such as endoscopic retrograde cholangiopancreatography [ERCP] (ICD-9 51.10) (MIRA)—versus ercp (MIMIC-IV), we implemented a procedure match evaluator (Supplementary Information 21). This evaluator used few-shot prompting to assess whether any given combination of procedures from the MIMIC-IV dataset and the AI agent or user study were equivalent. Herein, equivalence was determined in two key ways: (1) procedures that are literally the same but differed owing to minor inconsistencies, such as spelling mistakes on the MIMIC-IV side; and (2) procedures that represented very similar concepts. For example, a combination of procedures from MIMIC-IV such as excision of duodenum, open approach and excision of pancreas, open approach would be considered equivalent to MIRA’s request to perform a radical pancreaticoduodenectomy (ICD-9 52.7) in the case of a pancreatic cancer patient. Importantly, we restricted our evaluations to the intersection of cases where MIRA and physicians correctly diagnosed the patient, which is in line with the previous works of Hager et al.¹⁸.

Guideline adherence evaluations

To evaluate the adherence of MIRA compared with human physicians in following current medical guidelines when making treatment decisions, we developed the following evaluation pipeline: First, we manually compiled relevant treatment recommendations from established international guidelines: the 2020 Update of the WSES Jerusalem Guidelines for Appendicitis²⁷, the 2020 WSES Guidelines for Acute Cholecystitis⁴⁰, the 2019 WSES Guidelines for the Management of Severe Acute Pancreatitis⁴¹, and the American Society of Colon and Rectal Surgeons Clinical Practice Guidelines for Left-Sided Colonic Diverticulitis⁴². These guidelines were previously utilized in Hager et al.¹⁸. Additionally, we collected recommendations from the American Thoracic Society documents for pneumonia³⁰, the Core Curriculum 2024 recommendations from the American Journal of Kidney Diseases for urinary tract infection⁴³, and the American Society of Hematology 2020 guidelines for management of venous thromboembolism: treatment of deep vein thrombosis and pulmonary embolism⁴⁴. We then extracted and cleaned treatment recommendations, focusing only on medication-related aspects. Subsequently, we standardized medication names between the ground truth from MIMIC-IV data and prescriptions provided by either MIRA or human physicians using the Hospital Medication Standardizer (Supplementary Information 22). As this step was performed individually for each patient, we subsequently manually merged drug classes that represented the same medication categories despite differences in naming conventions (for instance ‘Antibiotic’ versus ‘antibiotics’) so that we were able to aggregate results across different patients. Our evaluations focused exclusively on the intersection of patients that were correctly diagnosed by both human physicians and MIRA. Due to the complexity of decision-making in pancreatic cancer, which usually involves multimodal data such as imaging, evaluations were restricted to the remaining seven diseases. For each, we established a Guideline Adherence Evaluator, which received guideline recommendations (as shown in Supplementary Data Table 52), the relevant guideline category to focus on, the patient’s exact diagnosis, and the complete list of hospital-prescribed medications (excluding medications already taken prior to hospitalization). The Evaluator assessed guideline adherence for each prescription, providing a binary outcome (true/false) with accompanying rationale, using structured templates as output. Notably, we did not evaluate new therapeutic prescriptions in reference to the underlying data in MIMIC-IV, because patient cases span from 2008 to 2019 and therefore do not necessarily reflect current best practices, making a comparison to the most recent medical guidelines a more relevant task. Please find the full evaluators instructions under Supplementary Information 23.

Safety and robustness evaluations

To evaluate the robustness of MIRA in scenarios where hospital admission may not be strictly necessary and patients could be managed safely in an ambulatory setting, we developed a novel dataset derived from MIMIC-IV cases of patients diagnosed with pneumonia or pulmonary embolism, identified from emergency department encounters through filtering via ICD codes. From these cohorts, ten cases per diagnosis were selected as templates. Because these lacked key narrative elements such as HPI, physical examination findings, and free-text descriptions of symptoms, each template was expanded under the supervision of a board-certified physician to generate four distinct patient variants. For each, synthetic but clinically plausible histories, physical examinations, laboratory values, radiology findings, vital signs and medication reconciliation were written and adapted to represent a range of illness severities and clinical presentations. A ground-truth disposition recommendation (either hospital admission or discharge from the emergency department) was assigned to each case based on established clinical criteria. We chose pneumonia and pulmonary embolism because validated objective scoring systems, CURB-65 and sPESI respectively, are routinely used as surrogates for admission necessity. Our primary evaluation end-point was recall, defined as the proportion of admission-requiring patients that were correctly triaged for admission. In this experiment, we gave MIRA the additional ability to request vital signs and modified the original Admission tool to besides giving a diagnosis also make a decision (admission or discharge from emergency department) together with its reasoning process (CloseCase tool). In another set of experiments to evaluate the agent’s consistency across diverse patient profiles, we measured diagnostic correctness for each patient encounter under six prespecified prompt perturbations (bias conditions, inspired by Schmidgall et al.¹⁴), directly comparing it with the corresponding baseline outputs. For each of the 8 target diagnoses, we sampled 10 patient cases, yielding a total of 480 paired evaluations (6 biases × 10 patients × 8 diagnoses). The same patients were used across bias experiments. Bias conditions were implemented by appending additional instructions to the patient agent prompt (Supplementary Information 24), following the same experimental procedures as in the baseline runs. The primary end-point was the change in diagnostic accuracy, expressed as the risk difference (bias − baseline) and was pooled across all diagnoses.

Statistics and reproducibility

We report exact (Clopper–Pearson) 95% confidence intervals for leak and disclosure rates. For diagnostic accuracy, paired comparisons between MIRA and human evaluators used McNemar’s exact test; when a discordant cell was empty, we used two-sided Fisher’s exact. We additionally report the paired odds ratio (OR = n₁₀/n₀₁) with exact 95% confidence intervals. All paired analyses used the intersection of admissions evaluated by both MIRA and human physicians. P values were Holm-adjusted within each diagnosis subset, and the supplementary data tables report the overall exact p value and Holm-adjusted per-diagnosis values. Error bars on diagnostic accuracy bar plots are Wilson binomial 95% confidence intervals. For the diagnostic procedure tasks, physical examination was analysed as correct versus incorrect (McNemar’s test), and for microbiology, radiology, and blood-value selection we compared which group (MIRA or humans) missed fewer MIMIC-IV items, each yielding a per-simulation 2 × 2 discordant table and we analysed paired differences with the two-sided Wilcoxon signed-rank test. We report median, interquartile range, and rank-biserial correlation, and controlled the study-wide false-discovery rate using the Benjamini–Hochberg procedure. Where applicable, micro- and macro-averaged metrics are reported. For the admission medication prescription and procedures requests we computed precision, recall, and F₁ with 95% confidence intervals from paired bootstrapping with 10,000 resamples with preserved patient pairing. Guideline adherence was again evaluated with McNemar’s exact test or Fisher’s exact when needed. For the admission versus discharge analysis, we report standard classification metrics with 95% confidence intervals from a patient-cluster bootstrap with 10,000 resamples, where the cluster is the template linking the variations of different patient cases that shared the same MIMIC-IV starting information. The primary metric was sensitivity for admission. Directional error bias was assessed with McNemar’s exact test. For the perturbation experiments, we measured paired significance with McNemar’s exact test with Holm adjustment across the six biases within each diagnosis and for the pooled analysis. Uncertainty was quantified using paired, non-parametric bootstrap 95% confidence intervals with 10,000 resamples. Unless otherwise stated, each analysis was executed once on a prespecified set of independent evaluation units, and the reported summary statistics therefore reflect variation across those units.

Statement on the use of AI tools

In accordance with the COPE (Committee on Publication Ethics) position statement of 13 February 2023 (https://publicationethics.org/cope-position-statements/ai-author), the authors hereby disclose the use of the following AI models during the writing of this article: GPT-4o (ChatGPT, OpenAI) for checking and improving spelling and grammar.

Ethics statement

This study involves the analysis of the MIMIC-IV dataset with LLMs. Patient information was processed in accordance with PhysioNet’s ‘Credentialed Health Data Agreement’ and the ‘Responsible use of MIMIC data with online services like GPT’ statement.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.