In the paper published in the journal Scientific Reports, researchers assessed Cigna's stress management toolkit, which included an artificial intelligence (AI)-based tool known as the Cigna StressWaves Test (CSWT). The study aimed to scrutinize the claim of the CSWT being a 'clinical grade' assessment through an independent validation process. Findings revealed that the CSWT lacked repeatability and exhibited poor convergent validity. The tool's public availability without adequate validation data raised concerns about the premature deployment of digital health tools for stress and anxiety management.
Background
The global impact of psychological stress on health, ranging from cardiovascular issues to depression, has long been recognized. Traditional methods for monitoring stress, like the Perceived Stress Scale (PSS), relied on patient-reported questionnaires known for their reliability and validity.
The PSS has been widely used in stress measurement and as a benchmark in studying various stress indicators, such as cortisol concentration, and evaluating stress management techniques. Recently, the emergence of AI-based digital tools for stress, depression, and anxiety assessment, like the CSWT, has gained attention. However, despite its wide availability, the CSWT needs published validation data, a critical aspect considering its integration into stress management strategies by a global health services company.
Methodology Overview: CSWT Evaluation
This study involved 60 participants aged 18 or above, recruited from Arizona State University. Before the experiment, researchers obtained institutional approval (IRB #00016588) and collected informed consent.
Criteria for inclusion encompassed English-speaking individuals above 18 years old. The experiment employed standardized equipment (Logitech H390 Wired Headset connected to a Dell computer) in a quiet laboratory setting. Participants must be made aware of their CSWT stress scores throughout the study.
CSWT: The CSWT, positioned as a clinical-grade tool for stress assessment based on speech analysis, prompted participants to select a question and respond for at least 60 seconds. Each participant underwent the test twice consecutively, choosing from eight prompts per session. Only one participant selected the same prompt for both sessions. The CSWT provides ordinal and gradient scale outputs for stress levels. Participants also completed the 10-question PSS and scored numerically and on a three-level ordinal scale. The researchers randomized the administration order of CSWT and PSS for the participants.
Statistical Analysis: The primary analysis focused on test–retest reliability, assessed via intra-class correlation (ICC) between the two CSWT administrations. Secondary analysis evaluated CSWT validity against PSS, measuring correlations between PSS scores and the average CSWT scores from both administrations. Cohen's Kappa measured ordinal ratings' repeatability and validity relative to PSS. Statistical analyses utilized R Studio with the IRR package.
Power Analysis: Sample size estimations, targeting a primary analysis of test–retest reliability, anticipated an expected ICC of 0.75. A sample size of 55 participants, with an additional 5 for potential data issues, was calculated to achieve 80% power to detect correlations of at least 0.33 between CSWT and PSS, considering a significance level of 0.05. Researchers set the lower threshold for acceptable ICC at 0.5 due to the inherent variability in speech-related acoustic features.
CSWT Evaluation: Reliability and Validity
In this study, 60 participants (36 females, 24 males) completed the CSWT twice during a single session to examine its reliability and the PSS once for assessing validity. The test–retest reliability analysis revealed that the CSWT lacked repeatability, indicating a non-significant ICC (ICC = −0.106, p > 0.05).
Similarly, the assessment of convergent validity between the CSWT and the PSS demonstrated a lack of significant correlation (r = 0.200, p > 0.05). Multiple linear regression, utilizing both CSWT administrations to predict the PSS, only accounted for 6.9% of the variance in the PSS, further underscoring the CSWT's poor validity relative to the PSS.
These findings challenge the claims of the CSWT being a clinically robust tool and question its effectiveness. The poor reliability and validity suggest limited agreement between the CSWT and the established PSS, raising concerns about its utility, especially considering its integration into broader stress management offerings. The extensive availability of such tools through prominent platforms might lead users to rely on them for critical health decisions, potentially resulting in misleading assessments, inappropriate treatments, and unwarranted anxiety or reassurance.
Moreover, beyond its limitations in reliability and validity, the CSWT's interpretation of psychological stress levels, particularly in extrapolating trait psychological stress from brief speech samples, raises feasibility concerns. This study serves as a cautionary example of deploying AI-driven tools without robust validation data, urging the need for stringent verification processes akin to those in healthcare to ensure the credibility of digital health tools, particularly in mental health assessment.
Additionally, the challenges associated with developing speech-based health measures, as evidenced by the variability in speech production and model transparency, contribute to the limitations of tools like the CSWT. The inherent variability in human speech production poses constraints on accurately predicting complex health states like psychological stress directly from speech, suggesting a need for cautious interpretation and verification of claims made by AI-driven health tools based on speech analysis.
Conclusion
To sum up, evaluating the CSWT against the PSS revealed substantial shortcomings in reliability and validity. These findings raise significant concerns regarding the CSWT's claim of clinical-grade performance and its effectiveness as a reliable stress assessment tool. The study emphasizes the critical need for stringent validation processes for AI-driven health tools, especially mental health assessment.
Additionally, the challenges associated with speech-based health measures highlight the necessity for transparent validation and cautious interpretation of claims made by such tools. It underscores the importance of robust verification and transparent reporting in ensuring the reliability and accuracy of digital health tools in clinical settings, particularly for mental health assessment.