Despite the promise of AI-human teamwork, new research reveals a surprising limitation in decision-making tasks—yet hints at a breakthrough for creative fields where AI can enhance human ingenuity.
Research: When combinations of humans and AI are useful: A systematic review and meta-analysis. Image Credit: TarikVision / Shutterstock
In an article recently published in the journal Nature Human Behaviour, researchers at the Massachusetts Institute of Technology (MIT) systematically reviewed and analyzed 106 experimental studies to evaluate when human–artificial intelligence (AI) combinations outperformed either humans or AI alone.
Findings revealed that human–AI collaborations generally performed worse than the best solo performer, particularly in decision-making tasks, but showed greater benefits in creative tasks. The study highlighted variability in human–AI performance, suggesting potential for optimizing these collaborations in specific contexts. However, it also emphasized that while human-AI synergy (where the combined system outperforms both humans and AI alone) was rare, human augmentation (where the combined system outperforms humans alone) was observed with a medium to large positive effect size.
Background
Human-AI collaboration has become increasingly prevalent across diverse fields, from healthcare and finance to daily activities like travel and shopping, as the unique strengths of human cognition and AI’s computational power create potential for innovative solutions. Prior studies have shown that while combining human creativity and intuition with AI’s analytical precision can enhance decision-making, these systems do not consistently outperform humans or AI alone. Factors such as communication barriers, trust issues, and coordination challenges can limit the success of collaborations.
To address these challenges, this paper conducted a systematic literature review and meta-analysis of 106 experiments published between 2020 and 2023, analyzing 370 effect sizes. By focusing on "human–AI synergy" and "human augmentation," the study revealed that while AI could improve individual human performance, it did not consistently generate true synergy. This analysis identified moderating factors like task type and relative human versus AI performance as critical, thus providing insights into more effective future human–AI system design.
Research Design and Analytical Methodology
This meta-analysis adhered to Kitchenham's systematic review guidelines and the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) standards. It focused on studies that examined human-AI collaboration in task performance. Eligible studies had to report quantitative outcomes for human-only, AI-only, and human-AI systems.
Literature searches were conducted in databases like the Association for Computing Machinery (ACM) Digital Library and Web of Science, targeting works from 2020 to 2023 to capture current AI applications. The search utilized a carefully developed string to encompass human, AI, collaboration, and experimental components, with additional forward and backward searches for completeness.
Data collection included recording averages and standard deviations for human, AI, and collaborative performances and calculating standard deviations when only confidence intervals or errors were provided. In cases where data were incomplete, the authors contacted the original study authors or used WebPlotDigitizer to convert graph data into usable values.
For analysis, Hedges’ g was calculated as the effect size, comparing human-AI collaboration outcomes against baseline performances. A random-effects model was employed to address variability across tasks, experimental designs, and participant types. Due to the dependency among effect sizes in certain studies, a three-level meta-analytic model was used, which accounted for variance across levels (within-experiment, between-experiment, and across studies) and applied robust variance estimation for standard errors and statistical tests.
Bias assessments, including funnel plots and regression tests, indicated minimal publication bias regarding human–AI synergy, but potential bias favoring studies showing human augmentation, where human-AI systems outperformed humans alone. Sensitivity analyses, robustness checks, and leave-one-out tests confirmed the reliability of findings, indicating a robust assessment of human–AI performance across diverse experimental contexts.
Study Outcomes and Insights
In a comprehensive review of 5,126 papers, 74 studies met the inclusion criteria, encompassing 106 unique experiments and 370 effect sizes that evaluated human-AI collaboration on task performance. Using a three-level meta-analytic model, the analysis revealed that human-AI systems generally underperformed compared to the best-performing solo system (human or AI alone), with a small but significant negative effect size (g = −0.23). However, when compared solely to human performance, human–AI systems significantly enhanced performance, showing a medium to large positive effect (g = 0.64).
The results also highlighted substantial heterogeneity, with task type impacting human–AI synergy. In particular, decision tasks generally led to performance declines in human-AI combinations (g = -0.27), while creation tasks showed potential, though statistically insignificant, gains in performance (g = 0.19). Moderator analysis revealed that factors such as task type, data type, and relative performance of humans and AI influenced synergy and augmentation levels.
Notably, systems combining human and AI capabilities excelled when humans alone performed better than the AI alone, while systems with a stronger AI alone led to performance reductions in combined systems. Other significant moderating factors included AI type, publication year, and experimental design, with no significant effects found for participant type, confidence levels, or AI explanation provided during tasks.
Analysis and Insights
This study examined the effectiveness of human-AI collaboration, evaluating both performance gains and losses over three years of research. Findings revealed that while human–AI systems sometimes yielded better results than humans working alone, they often failed to outperform either humans or AI individually.
This was largely due to challenges in balancing reliance on AI systems, where users may over-rely on AI or, conversely, mistrust it. Interestingly, performance varied with task type. In decision tasks, where humans typically made the final choice, AI integration could lead to performance declines, while in creative tasks, AI augmentation generally enhanced results by aiding with routine aspects.
Additional analysis showed that task-specific designs and clear divisions of labor between humans and AI could improve outcomes. Notably, when humans outperformed AI, collaboration yielded performance gains, underscoring the importance of matching task strengths to either humans or AI. Other moderating factors, such as task accuracy and participant experience, seemed less impactful, suggesting a shift in focus toward developing standardized benchmarks and robust performance metrics.
Future recommendations included prioritizing human-AI processes for creative tasks, establishing standardized performance metrics, and building a shared repository for collaboration research. These steps aim to refine human-AI integration, optimize it for diverse real-world applications, and foster greater synergy.
Conclusion
In conclusion, while human–AI collaborations offered promise, they often failed to outperform the best solo performer, especially in decision-making tasks. However, they showed potential in creative tasks where AI assisted with repetitive elements.
This study underscored the importance of aligning task types with human or AI strengths and suggested that a structured approach, including defined metrics, shared research resources, and targeted process designs, could enhance collaboration outcomes. Future research should focus on optimizing human–AI design for specific contexts to maximize their combined strengths in real-world applications.