In an article published in the journal Scientific Reports, researchers from the East China Jiaotong University, China, and the Jeonbuk National University, South Korea, developed an innovative methodology for classifying sound events into known and unknown categories. They discussed that their method could detect unknown sound events that are not present in the training data, which is a challenging problem in realistic scenarios. The suggested technique uses deep learning and self-supervised learning to achieve robust and accurate performance.
Background
Sound is a form of energy that travels through different mediums, such as air, water, or solids, as vibrations. When an object vibrates, it generates waves, and these vibrations create sound waves that travel through the medium. The human ear can detect these waves, which are then converted into electrical signals by the ear and sent to the brain for interpretation. The brain processes these signals, allowing individuals to perceive and comprehend the sounds of their surroundings.
Sound events are sequences of audio clips obtained from various actions, such as music genres, human speech, water running, animal sounds, etc. Sound event classification is a technique used to distinguish the type of audio clip from the given audio sequence. It has many applications in areas such as audio surveillance, smart home, environmental monitoring, and multimedia retrieval.
In the past, deep learning methods have shown great success in sound event classification by integrating feature extraction and decision-making algorithms. However, most of the existing sound event classification methods assume a closed set scenario, where the training and test data share the same feature embedding spaces and labels. This means that they cannot handle unknown sound events that are not included in the training data, which often occur in realistic scenarios. Therefore, there is a need for a sound event classification method that can cope with open set recognition, where the test data may contain unknown classes that are not seen during training.
About the Research
In the present paper, the authors designed an open-set sound event classification network, which consists of an encoder and a division head. The encoder receives the two-dimensional (2-D) logarithmic-Mel (log-Mel) temporal spectrogram as input and extracts high-level features from the audio signal. The decision head determines the probabilities of the known sound events and identifies unknown sound events based on a threshold.
The key concept of the proposed approach revolves around creating a compact cluster structure in the feature space for recognized classes. This aids in identifying unknown classes by providing a large space to locate unknown samples within the embedded feature space. To achieve this, the study applied two types of losses to optimize the model: center loss and supervised contrastive loss. The center loss tries to reduce the intra-class distance by bringing embedded features closer to the cluster center, whereas the contrastive loss encourages the dispersion of inter-class features.
Moreover, the research investigated the efficacy of self-supervised learning in detecting unknown sound events. Self-supervised learning technique leverages unlabeled data to learn useful representations without human supervision. Self-supervised contrastive loss is used to train the network on the unlabeled MagnaTagATune/DCASE2019 Subtask 1C dataset and then fine-tuned on the downstream dataset.
Research Findings
The authors evaluated their method on various datasets, such as ESC-50, UrbanSound8K, and DCASE2019 Subtask 1C. They compared their technique with several baseline methods, such as the SoftMax classifier, OpenMax algorithm, and classification-reconstruction learning for open-set recognition (CROSR). Moreover, they conducted ablation studies to analyze the effects of different components of their method, such as center loss, contrastive loss, and self-supervised learning.
The outcomes showed that the newly designed methodology achieved significant improvements in both known and unknown sound event classification as compared to the baselines. This method also outperformed the state-of-the-art methods on the DCASE2019 Subtask 1C dataset. The study attributed the method's efficiency to the compact cluster structure and self-supervised learning, which enhanced the discriminability and generalization of features.
Applications
The recommended approach for sound event recognition has potential applications in various domains, including smart homes, security systems, and healthcare. For example, this method can be used to monitor the sound environment in a smart home and alert the user if there is any abnormal or unknown sound event. In the realm of security, it enhances building security by detecting intruders or potential threats through sound events. Within healthcare, the method analyzes sound events associated with patients or elderly individuals, providing accurate diagnoses or assistance as needed. Furthermore, it can be applied to curate personalized playlists or recommendations based on user preferences or moods.
Conclusion
The authors acknowledged challenges and limitations in their development. They suggested the following directions for future research:
- Exploring other types of self-supervised learning methods, such as contrastive predictive coding, masked reconstruction, or rotation prediction, to further improve the feature learning ability of the network
- Incorporating other types of information, such as spatial, temporal, or semantic information, enhances the sound event classification and unknown detection performance
- Applying the proposed method to other domains or modalities, such as image, video, or text, also faces the open set recognition problem.
In summary, the presented method is robust, effective, and efficient for detecting and classifying unknown sound events. It achieved superior performance in both known and unknown sound event classification as compared to the existing methods. Moreover, it benefited from self-supervised learning, which leveraged the unlabeled data to capture the rich representative features of sound events.