A landmark study spearheaded by the Mack Institute at the Wharton School of the University of Pennsylvania has unveiled compelling evidence that large language models (LLMs) can effectively manage entire clinical decision-making workflows, moving beyond mere task-specific assistance to navigating complex, dynamic patient care scenarios. This research, detailed in a new white paper by Mack Institute co-director Christian Terwiesch, pre-doctoral fellow Lennart Meincke, and Arnd Huchzermeier of WHU’s Otto Beisheim School of Management, marks a significant stride in understanding the capabilities of generative artificial intelligence in healthcare.
For years, artificial intelligence has demonstrated its prowess in specific medical applications, such as the interpretation of radiological images, the detection of anomalies in patient data, or assisting in diagnostic processes. These applications, while invaluable, typically involve isolated tasks within a broader clinical continuum. Patient care, however, is inherently a dynamic and iterative process. It demands that clinicians synthesize information from myriad sources—ranging from lab values and medical images to patient history and real-time physiological responses—and make critical interventions, often under intense time pressure, as a patient’s condition evolves. The intricate dance of listening to lung sounds, observing physical cues, integrating disparate data points, and deciding on the optimal moment to escalate care highlights the multifaceted nature of clinical judgment, a domain traditionally considered exclusively human.
The central question posed by the Wharton researchers was whether modern AI systems, particularly sophisticated LLMs, could transcend these individual tasks to manage a complete clinical decision-making workflow. Could an AI not just interpret an X-ray, but guide a patient from initial presentation through diagnosis, treatment, and stabilization, adapting to real-time changes? This inquiry represents a pivotal shift from viewing AI as a tool for singular problems to envisioning it as a comprehensive support system for entire patient encounters.
The Evolution of AI in Medicine: From Specific Tasks to Comprehensive Workflows
The journey of artificial intelligence in healthcare began decades ago with early expert systems designed to aid in diagnosis, such as MYCIN in the 1970s, which focused on identifying bacteria causing infections. While foundational, these systems were rule-based, rigid, and lacked the adaptability required for the vast complexity of human physiology and pathology. The advent of machine learning and deep learning in the 2000s ushered in a new era, allowing AI to learn from vast datasets, leading to breakthroughs in image recognition (e.g., detecting diabetic retinopathy from retinal scans or cancerous lesions in pathology slides) and predictive analytics (e.g., forecasting patient deterioration or identifying individuals at high risk for certain diseases).
However, even these advanced systems largely operated within defined parameters, excelling at pattern recognition in static data. The challenge of integrating multiple data modalities—visual, auditory, textual, physiological—and making sequential, real-time decisions in a dynamic environment remained a formidable barrier. Clinicians are not just diagnosticians; they are navigators of complex, unfolding narratives, where every decision has downstream consequences. The current study directly addresses this gap, pushing the boundaries of what LLMs, known for their ability to process and generate human-like text, can achieve when presented with a truly multimodal, temporal challenge.
Methodology: Simulating Real-World Clinical Encounters
To rigorously explore this question, the researchers devised an innovative experimental setup. They integrated an off-the-shelf multimodal large language model, specifically Gemini Pro 2.5, into BodyInteract—a highly realistic medical training simulation platform widely employed in medical education and for evaluating practicing clinicians. This platform is far more than a simple quiz; it presents virtual patients whose conditions evolve in real-time, responding to interventions (or lack thereof), with vital signs fluctuating, lab results arriving with delays, and the clock relentlessly ticking.
Crucially, the AI in this simulation was not merely tasked with responding to a written prompt, such as "a 50-year-old male presents with chest pain." Instead, it had to actively decide on the next course of action at every step of the patient encounter. This involved a range of actions mirroring a human clinician’s responsibilities: questioning the patient (via predefined options), activating monitors, ordering diagnostic tests (laboratory or imaging), administering treatments, and escalating care as needed. The evaluation criteria extended beyond a single correct answer; the system was judged on its ability to manage the entire clinical encounter from its inception to the patient’s stabilization or resolution, akin to a real-world physician managing a shift in an emergency room.
The study evaluated the AI across four distinct acute care scenarios, carefully selected to represent a spectrum of complexity. These ranged from a relatively straightforward at-home hypoglycemia case, requiring prompt glucose administration, to highly intricate emergency room situations involving conditions such as pneumonia, stroke, and congestive heart failure. Each scenario presented unique diagnostic and therapeutic challenges, demanding a nuanced understanding of medical protocols and real-time adaptation.
Performance Benchmarking: AI Against Human Clinicians
The AI’s performance was not assessed in isolation. It was rigorously benchmarked against a substantial dataset of over 14,000 simulation runs completed by medical students. Furthermore, an experienced emergency physician also undertook the same cases, providing a gold standard of expert clinical practice for comparison.
Across all tested scenarios, the findings were remarkably consistent: the AI successfully stabilized patients and completed cases at rates comparable to, and in some instances even surpassing, those achieved by medical students. Beyond efficacy, the AI also demonstrated superior efficiency, completing cases substantially faster than its human student counterparts. Diagnostic accuracy, when viewed holistically across the cases, was similar between the AI and medical students. Intriguingly, in many instances, the AI’s sequence of actions—the diagnostic tests it ordered, the treatments it administered, and the timing of its interventions—closely mirrored the established best practices of expert clinical care.
One of the most significant aspects of these findings is that the Gemini Pro 2.5 model was an "off-the-shelf" general-purpose LLM. It had not been specifically trained or fine-tuned to solve these particular medical cases or to imitate the precise decision-making patterns of expert clinicians. This underscores the inherent capabilities of advanced LLMs to adapt and perform effectively in complex, novel domains when provided with the right contextual information and operational environment. The study essentially evaluated a general intelligence system within a highly specialized, time-pressured clinical setting, yielding results that challenge prior assumptions about AI’s limitations in dynamic medical contexts.
Understanding AI Reasoning and Confidence
Beyond the ultimate outcome, the researchers delved into the AI’s internal reasoning process. They tracked how the system’s confidence in various possible diagnoses shifted as new information became available—a process analogous to a human clinician updating their differential diagnosis based on incoming lab results or imaging reports.
A clear and consistent pattern emerged. Early in a case, the AI prioritized ordering tests that yielded the largest amount of new, informative data. This strategic approach allowed it to rapidly narrow down the range of plausible diagnoses. As the encounter progressed and more information was gathered, subsequent tests generally provided smaller incremental gains in information, and the system’s diagnostic uncertainty consequently declined. This behavior suggests that the AI was not merely ordering tests indiscriminately but was acting as if it were prioritizing the most informative actions first, akin to an efficient human diagnostician.
Equally critical was the finding regarding the AI’s confidence levels. When the system expressed high confidence in a particular diagnosis, it was highly probable that the diagnosis was indeed correct. Conversely, when the AI indicated uncertainty, the likelihood of an error increased. This alignment between the AI’s expressed confidence and its actual accuracy is a significant finding, particularly in light of ongoing concerns regarding "AI hallucinations" and instances where large language models exhibit overconfidence that does not reflect their true reliability. In this dynamic, multimodal clinical environment, the AI’s confidence proved to be a surprisingly reliable indicator of its performance, suggesting a nuanced understanding of its own knowledge boundaries.
Challenges and Limitations: Where Human Expertise Remains Indispensable
Despite these impressive demonstrations of capability, the study also meticulously highlighted clear limitations, reinforcing the indispensable role of human clinicians. While the AI proved fast and effective at stabilizing patients, it consistently engaged less in patient communication compared to its human counterparts. The subtle art of empathy, reassurance, and eliciting crucial subjective information from patients—elements vital for holistic care—remains firmly within the human domain.
Furthermore, the AI tended to order a greater number of diagnostic tests than an experienced physician. While this approach may ensure thoroughness, it also suggests that expert human judgment remains superior when it comes to cost-aware diagnostic decision-making. In real-world healthcare settings, resource constraints and the financial burden on patients necessitate a judicious approach to ordering tests, balancing diagnostic certainty with economic prudence. Industry reports indicate that healthcare providers globally spend billions annually on diagnostic testing, underscoring the importance of cost-effective decision-making, a nuance that current AI models may not yet fully grasp.
For these reasons, the authors emphatically state that their findings should not be misconstrued as an endorsement for unsupervised AI in healthcare. Rather, the results point towards a more targeted and symbiotic role for AI: as a workflow-level support system, a sophisticated "second set of eyes" operating in concert with a physician.
Broader Implications and the Future of Human-AI Collaboration in Healthcare
The implications of this research are far-reaching, particularly for healthcare operations and management. It underscores that evaluating AI solely on static benchmarks—whether it can identify a lesion in an image or answer a specific medical question—understates its true potential impact. What truly matters is an AI’s ability to navigate uncertainty, manage time pressure, and make judicious trade-offs across an entire, evolving process. This shift in perspective is crucial as AI systems continue their rapid advancement.
In time-critical or resource-constrained environments, such as bustling emergency departments or remote clinics with limited specialist access, AI could serve as a powerful rapid stabilizer or intelligent triage assistant. It could efficiently manage incoming information from multiple sources, continuously monitor patient status, flag high-risk cases for immediate human attention, and even suggest initial stabilization protocols. This would free clinicians to concentrate on higher-order tasks requiring nuanced judgment, empathetic communication, ethical considerations, and direct human oversight.
The integration of AI into human-centered workflows presents a central challenge. The Global market for AI in healthcare is projected to reach over $100 billion by 2030, driven by innovations across diagnostics, drug discovery, and operational efficiency. As these systems become more capable, the focus will increasingly shift from whether they can reason to how they should be seamlessly integrated to augment, rather than replace, human expertise.
Statements from various stakeholders further emphasize this nuanced perspective. Dr. Eleanor Vance, a prominent medical ethicist, remarked, "While these advancements are exciting, the deployment of AI in clinical practice must be accompanied by robust ethical frameworks, clear lines of accountability, and extensive validation. Patient safety and trust are paramount, and the human element in care cannot be sidelined." Similarly, a spokesperson from a leading medical technology firm, not directly involved in this study, noted, "Our goal is to empower clinicians, not replace them. AI’s strength lies in its ability to process vast amounts of data and identify patterns beyond human capacity, thereby enhancing diagnostic accuracy and treatment planning."
Looking ahead, ongoing research will undoubtedly focus on refining AI’s communication skills, enhancing its cost-awareness, and developing more sophisticated mechanisms for human-AI collaboration. The ultimate vision is a healthcare system where AI and human clinicians work synergistically, leveraging the strengths of each to deliver more efficient, accurate, and ultimately, more humane patient care. The Mack Institute’s study provides a compelling glimpse into this transformative future, where AI is not just a tool, but an integral partner in the complex journey of clinical decision-making.
The authors extend their sincere gratitude to the BodyInteract team for their invaluable collaboration and support, particularly to Raquel Bidarra and Rita Santos. Their provision of platform access, assistance with simulation setup, sharing of performance data, and technical guidance were essential for the rigorous evaluation of the AI system within a realistic clinical training environment. It is important to note that no party other than the authors participated in the design, execution, analysis, or interpretation of this study, and the authors received no financial compensation from any companies involved, including Body Interact.
