Tuesday, March 4, 2025
No menu items!
HomeNatureTrain clinical AI to reason like a team of doctors

Train clinical AI to reason like a team of doctors

Two people using interactive mixed reality glasses for medical interventions of the human head.

Visitors to an interactive AI exhibition at the German Museum of Technology in Berlin use virtual-reality glasses to view an image of the brain.Credit: Jens Kalaene/dpa/Alamy

Following a surge of excitement after the launch of the artificial-intelligence (AI) chatbot ChatGPT in November 2022, governments worldwide have been striving to craft policies that will foster AI development while ensuring the technology remains safe and trustworthy. In February, several provisions of the European Union’s Artificial Intelligence Act — the world’s first comprehensive AI regulation — took effect, prohibiting the deployment of certain applications, such as automated systems that claim to predict crime or infer emotions from facial features.

Most AI systems won’t face an outright ban, but will instead be regulated using a risk-based scale, from high to low. Fierce debates are expected over the act’s classification of ‘high-risk’ systems, which will have the strictest oversight. Clearer guidance from the EU will begin emerging in August, but many AI-driven clinical solutions are likely to attract scrutiny owing to the potential harm associated with biased or faulty predictions in a medical setting.

Clinical AI — if deployed with caution — could improve health-care access and outcomes by streamlining hospital management processes (such as patient scheduling and doctors’ note-taking), supporting diagnostics (such as identifying abnormalities in X-rays) and tailoring treatment plans to individual patients. But these benefits come with risks — for instance, the decisions of an AI-driven system cannot always be easily explained, limiting the scope for real-time human oversight.

This matters, because such oversight is explicitly mandated under the act. High-risk systems are required to be transparent and designed so that an overseer can understand their limitations and decide when they should be used (see go.nature.com/3dtgh4x).

By default, compliance will be evaluated using a set of harmonized AI standards, but these are still under development. (Meeting these standards will not be mandatory, but is expected to be the preferred way for most organizations to demonstrate compliance.) However, as yet, there are few established technological ways to fulfil these forthcoming legal requirements.

Here, we propose that new approaches to AI development — based on the standard practices of multidisciplinary medical teams, which communicate across disciplinary boundaries using broad, shared concepts — could support oversight. This dynamic offers a useful blueprint for the next generation of health-focused AI systems that are trusted by health professionals and meet the EU’s regulatory expectations.

Collaborating with AI

Clinical decisions, particularly those concerning the management of people with complex conditions, typically take various sources of information into account — from electronic health records and lifestyle factors to blood tests, radiology scans and pathology results. Clinical training, by contrast, is highly specialized, and few individuals can accurately interpret multiple types of specialist medical data (such as both radiology and pathology). Treatment of individuals with complex conditions, such as cancer, is therefore typically managed through multidisciplinary team meetings (known as tumour boards in the United States) at which all of the relevant clinical fields are represented.

Because they involve clinicians from different specialities, multidisciplinary team meetings do not focus on the raw characteristics of each data type, because this knowledge is not shared by the full team. Instead, team members communicate with reference to intermediate ‘concepts’, which are widely understood. For example, when justifying a proposed treatment course for a tumour, team members are likely to refer to aspects of the disease, such as the tumour site, the cancer stage or grade and the presence of specific patterns of molecular markers. They will also discuss patient-associated features, including age, the presence of other diseases or conditions, body mass index and frailty.

These concepts, which represent interpretable, high-level summaries of the raw data, are the building blocks of human reasoning — the language of clinical debate. They also typically feature in national clinical guidelines for selecting treatments for patients.

Notably, this process of debate using the language of shared concepts is designed to facilitate transparency and collective oversight in a way that parallels the intentions of the EU AI Act. For clinical AI to comply with the act and gain the trust of clinicians, we think that it should mirror these established clinical decision-making processes. Clinical AI — much like clinicians in multidisciplinary teams — should make use of well-defined concepts to justify predictions, instead of just indicating their likelihood.

Explainability crisis

There are two typical approaches to explainable AI1 — a system that explains its decision-making process. One involves designing the model so it has built-in rules, ensuring transparency from the start. For example, a tool for detecting pneumonia from chest X-rays could assess lung opacity, assign a severity score and classify the case on the basis of predefined thresholds, making its reasoning clear to physicians. The second approach involves analysing the model’s decision after it has been made (‘post hoc’). This can be done through techniques such as saliency mapping, which highlights the regions of the X-ray that influenced the model’s prediction.

However, both approaches have serious limitations2. To see why, consider an AI tool that has been trained to help dermatologists to decide whether a mole on the skin is benign or malignant. For each new patient, a post-hoc explainability approach might highlight pixels in the image of the mole that were most important for the model’s prediction. This can identify reasoning that is obviously incorrect — for instance, by highlighting pixels in the image that are not related to the mole (such as pen marks or other annotations by clinicians)3.

Member view of European Parliament participants in a series of votes in Brussels, Belgium, 2024.

The European Parliament in Brussels adopted the Artificial Intelligence Act last March.Credit: Geert Vanden Wijngaert/AP Photo/Alamy

When the mole is highlighted, however, it might be difficult2,4 for an overseeing clinician — even a highly experienced one — to know whether the set of highlighted pixels is clinically meaningful, or simply spuriously associated with diagnosis. In this case, use of the AI tool might place an extra cognitive burden on the clinician.

A rules-based design, however, constrains an AI model’s learning to conform rigidly to known principles or causal mechanisms. Yet the tasks for which AI is most likely to be clinically useful do not always conform to simple decision-making processes, or might involve causal mechanisms that combine in inherently complex or counter-intuitive ways. Such rules-based models will not perform well in precisely the cases in which a physician might need the most assistance.

In contrast to these approaches, when a dermatologist explains their diagnosis to a colleague or patient, they tend not to speak about pixels or causal structures. Instead, they make use of easily understood high-level concepts, such as mole asymmetry, border irregularity and colour, to support their diagnosis. Clinicians using AI tools that present such high-level concepts have reported increased trust in the tools’ recommendations5.

In recent years, approaches to explainable AI have been developed that could encode such conceptual reasoning and help to support group decisions. Concept bottleneck models (CBMs) are a promising example6. These are trained not only to learn outcomes of interest (such as prognosis or treatment course), but also to include important intermediate concepts (such as tumour stage or grade) that are meaningful to human overseers. These models can thereby provide both an overall prediction and a set of understandable concepts, learnt from the data, that justify model recommendations and support debate among decision makers.

This kind of explainable AI could be particularly useful when addressing complex problems that require harmonization of distinct data types. Moreover, they are well suited to regulatory compliance under the EU AI Act, because they provide transparency in a way that is specifically designed to facilitate human oversight. For example, if a CBM incorrectly assigns an important clinical concept to a given patient (such as predicting an incorrect tumour stage), then the overseeing clinical team immediately knows not to rely on the AI prediction.

Moreover, because of how CBMs are trained, such concept-level mistakes can also immediately be corrected by the clinical team, allowing the model to ‘receive help’7 and revise its overall prediction and justification with the aid of clinician input. Indeed, CBMs can be trained to expect such human interventions and use them to improve model performance over time.

RELATED ARTICLES

Most Popular

Recent Comments