In an article published in the journal Scientific Data, researchers from the University of Glasgow, University College London, and the University of Dundee proposed a novel multimodal dataset called RVTALL for contactless speech recognition and analysis using radio frequency, visual text, audio, laser, and lip landmark information. They showed that the dataset offers a unique resource for exploring the potential of non-invasive techniques to capture and process speech-related information from diverse data sources.
Background
Speech recognition is the process of converting human speech into text or commands that machines can understand. It has many applications, such as voice assistants, dictation, transcription, translation, and authentication. However, it faces some challenges in specific environments, such as silent speech recognition (SSR) for people with speech disorders and multiple speakers' environments where the microphone captures sound from different sources without distinguishing the speakers' identities.
Most speech recognition systems rely on acoustic information from microphones, which may not be sufficient or reliable in some scenarios. This led to scientists exploring other modalities, such as visual, radar, and laser, to capture and analyze speech-related information from the physiological processes that produce sound, such as lip movement, vocal cord vibration, and head movement. These modalities can provide complementary or alternative information to the acoustic signals and enable contactless and non-invasive speech recognition and analysis.
About the Research
In the present study, the authors introduced a novel multimodal dataset that incorporates multiple modalities for SSR and speech enhancement, including ultra-wideband (UWB) radars, millimeter wave (mmWave) radar, and depth camera data. The dataset consists of the following data sources:
- 7.5 GHz Channel Impulse Response (CIR) data from UWB radars, which can capture the lip movement and mouth shape of the speaker.
- 77 GHz frequency modulated continuous wave (FMCW) data from mmWave radar, which can capture the vibration of the vocal cord and the lip movement of the speaker.
- Visual and audio information from a Kinect V2 camera, which can record the facial expression, lip movement, and voice of the speaker.
- Lip landmarks and laser data from a depth camera and a laser speckle detector can measure the shape and vibration of the lip and the skin of the speaker.
The dataset contains 400 minutes of annotated speech profiles collected from 20 volunteers with diverse backgrounds, speaking 15 words, 5 vowels, and 16 sentences. It includes data from different scenarios, such as single-person, dual-person, and single-person speech, with different distances from the sensors. The data collection was controlled by a multi-threaded script that synchronized the data recording across different sensors. The participants come from different regions, including Europe, China, and Pakistan, which adds diversity and complexity to the dataset.
Research Findings
The authors analyzed the signals from different sensors and showed that they can capture the speech information from different aspects. They discussed various examples of how the signals can be used for speech recognition tasks, such as vowels and word classification, speaker identification, speech enhancement, and lip reconstruction.
The paper evaluated the performance of different sensors and modalities for speech recognition tasks, using a convolutional neural network (CNN) based ResNet model. The results showed that the audio and video modalities achieved the highest accuracy, followed by the UWB and mmWave radar modalities. The laser speckle modality achieved the lowest accuracy, but it can be improved by combining it with other modalities. The research also showed that the sensor fusion scheme can improve speech recognition performance by leveraging complementary information from different sensors.
Applications
The dataset has potential applications for various fields, such as:
- SSR: It can be used to develop and test SSR systems that can assist patients with speech disorders or people who want to communicate silently, without relying on invasive or wearable sensors.
- Speech enhancement: Speech enhancement systems can be developed and tested to improve the quality and intelligibility of speech signals in noisy or multiple-speaker environments.
- Speech analysis: Speech characteristics and patterns of different speakers, such as gender, age, accent, emotion, and health can be studied and analyzed to capture the physical features of speech production.
- Speech synthesis: Various speech synthesis systems can be developed and verified to generate realistic and natural speech sounds from text or other modalities.
Limitations and Conclusion
The authors acknowledge the following limitations in their study:
- Small sample size that may not cover all the variations of speech characteristics, such as speaking habits, accents, and intonation.
- Lack of gender-, age, or health-specific data with respect to their effects on speech signals.
- Limited range of speech content focusing only on vowels, words, and sentences, which may not fully represent natural speech in real-world applications.
- Limited evaluation of speech recognition performance, lacking comparisons with state-of-the-art methods or evaluations for robustness in various environmental conditions.
In summary, the newly developed multimodal dataset is a robust and efficient tool for contactless lip reading and acoustic analysis. This resource is particularly valuable for researchers in speech recognition, capturing detailed physical movements of the entire head during human speech, including mouth gestures and vocal cord vibrations. This dataset serves as a comprehensive and insightful foundation for advancing research in speech-related technologies.