Wednesday, October 29, 2025
No menu items!
HomeNatureWe need a new Turing test to assess AI’s real-world knowledge

We need a new Turing test to assess AI’s real-world knowledge

Artificial intelligence (AI) models can perform as well as humans on law exams when answering multiple-choice, short-answer and essay questions (A. Blair-Stanek et al. Preprint at SSRN https://doi.org/p89q; 2025), but they struggle to perform real-world legal tasks. Some lawyers have learnt that the hard way, and have been fined for filing AI-generated court briefs that misrepresented principles of law and cited non-existent cases. The same is true in other fields. For example, AI models can pass the gold-standard test in finance — the Chartered Financial Analyst exam — yet score poorly on simple tasks required of entry-level financial analysts (see go.nature.com/42tbrgb).

Whenever assessments measure the intended skill inaccurately, it is considered a proxy failure. For example, a lawyer who scored A+ on an exam would be expected to avoid the kinds of error that an AI tool with a similar score might make in a real-world scenario. Better tests are urgently required to help guide the use of AI in complex, high-stakes situations.

One promising idea emerged in March at an Association for the Advancement of Artificial Intelligence workshop in Philadelphia, Pennsylvania: through extensive interaction, a specialist can tell whether an AI system genuinely understands or is merely imitating understanding.

Imagine an AI model attempting to ‘pass’ an interview with an acclaimed legal scholar such as Cass Sunstein at Harvard University in Cambridge, Massachusetts. Sunstein’s expert probing would be a better measure of the model’s legal knowledge than a standardized test or automatically scored benchmark. Passing the ‘Sunstein test’ would require an AI tool to display true legal mastery, being able to wade through ambiguity and contradiction, and not just answer multiple-choice questions or write an essay.

One might ask: why not simply test an AI model’s legal readiness with task-specific benchmarks, similar to those used in medicine for checking an AI tool’s ability to take notes for a physician? The goal, however, is not to test an AI tool’s ability to perform a specific legal task, or even a long list of them, but to test whether it has general-purpose legal knowledge that it can exercise systematically when performing any task.

I am not suggesting that Sunstein, or any single authority, should be appointed as the arbiter of AI expertise. The goal is to build systems that leading legal specialists broadly agree demonstrate genuine, trustworthy legal knowledge. A ‘robo-lawyer’ would need to cope in a diverse range of interviews with panels of experts — ranging from tax and constitutional lawyers to clerks, traffic officers and legal-aid workers. Such an approach would reduce issues around individual or ideological bias and avoid the trap of AI models merely mimicking one person’s style.

Could a machine reach human levels of expertise, subtlety and ethics? Only specialists can say. But imagine a US Supreme Court justice grilling an AI robo-lawyer in public. That would get everyone’s attention. It would be a spectacle much like multinational technology corporation IBM’s 2011 challenge on the US television quiz programme Jeopardy!. The company pitted its supercomputer Watson against human champions to demonstrate how far machine reasoning and natural-language processing had come.

RELATED ARTICLES

Most Popular

Recent Comments