A new AI model can predict when your Zoom meeting is about to go off the rails—spotting awkward silences and clunky conversations before they happen, so future virtual meetings can be smoother, more natural, and even enjoyable.
Study: Multimodal Machine Learning Can Predict Videoconference Fluidity and Enjoyment. Image Credit: Ground Picture / Shutterstock
Since the onset of the COVID-19 pandemic, workers have spent countless hours in videoconferences, now a fixture of office life. As more people work and live remotely, videoconferencing platforms such as Zoom, MS Teams, FaceTime, Slack, and Discord are also a huge part of socializing among family and friends. Some exchanges are more enjoyable and flow better than others, raising questions about how the medium of online meetings could be improved to increase both efficiency and job satisfaction.
A team of New York University scientists has developed an AI model that can identify aspects of human behavior in videoconferences, such as conversational turn-taking and facial actions, and predict, in real-time, whether or not the meetings are seen as enjoyable, fluid, comfortable, and flowing rather than awkward and marked by stilted turn-taking based on these behaviors.
"Our machine learning model reveals the intricate dynamics of high-level social interaction by decoding subtle patterns within basic audio and video signals from videoconferences," says Andrew Chang, a postdoctoral fellow in NYU's Department of Psychology and the lead author of the paper, which appears in the conference publication IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). "This breakthrough represents an important step toward dynamically enhancing videoconference experiences by showing how to avoid conversational derailments before they occur."
To develop this machine-learning model, the researchers trained it on more than 100 person-hours of Zoom recordings, taking voice, facial expressions, and body movements as input. The model then identified disruptive moments when conversations became unfluid or unenjoyable. More specifically, the scientists trained the model to differentiate between unfluid moments disrupting virtual meetings and more fluid exchanges.
Notably, the model gauged conversations with unusually long gaps in turn-taking as less fluid and enjoyable than those in which participants spoke over one another. Put another way, "awkward silences" were found to be more detrimental than the chaotic, enthusiastic dynamics of a heated debate.
To confirm the accuracy of the model's assessments, an independent team of more than 300 human judges viewed samples of the same videoconference footage, rating the fluidity of the conversations and how much they thought the meeting participants enjoyed the exchanges. Overall, the human raters closely matched the machine-learning model's assessments.
"Videoconferencing is now a prominent feature in our lives, so understanding and addressing its negative moments is vital for not only fostering better interpersonal communication and connection, but also for improving meeting efficiency and employee job satisfaction," says Dustin Freeman, a visiting scholar in NYU's Department of Psychology and the senior author of the paper. "By predicting moments of conversational breakdown, this work can pave the way for videoconferencing systems to mitigate these breakdowns and smooth the flow of conversations by either implicitly manipulating signal delays to accommodate or explicitly providing cues to users, which we are currently experimenting with."
The paper's other authors were Viswadruth Akkaraju and Ray McFadden Cogliano, graduate students at NYU's Tandon School of Engineering at the time of the research; Dustin Freeman, a visiting scholar at NYU's Department of Psychology, and David Poeppel, a professor in NYU's Department of Psychology and the Max Planck Society in Munich, Germany.
The research was supported, in part, by grants from the NYU Discovery Research Fund for Human Health, the National Institute on Deafness and Other Communication Disorders, part of the National Institutes of Health (F32DC018205), and Leon Levy Scholarships in Neuroscience.
Source:
Journal reference:
- A. Chang, V. Akkaraju, R. M. Cogliano, D. Poeppel and D. Freeman, "Multimodal Machine Learning Can Predict Videoconference Fluidity and Enjoyment," ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 2025, pp. 1-5, doi: 10.1109/ICASSP49660.2025.10889480, https://ieeexplore.ieee.org/document/10889480