In a recent paper published in the journal Scientific Reports, researchers introduced a novel self-supervised deep-learning framework for underwater acoustic target recognition models under the constraints of abundant unlabeled samples and sparsely labeled data.
Study: Advancing Deep Learning Models for Underwater Acoustic Target Recognition. This image was created with the assistance of DALL·E 3
Background
In recent years, there has been a concerted effort to develop end-to-end deep neural networks for the automatic extraction of intricate features, facilitating the identification of underwater acoustic targets. However, the scarcity of extensive, high-quality labeled datasets due to the prohibitive cost of annotations and inevitable annotation errors remains a significant challenge in the field of underwater acoustic target recognition.
The challenge of limited labeled samples is a common hurdle for deep learning models. In the current study, researchers employed the few-shot learning (FSL) technique to address the recognition problem of categories with few samples by leveraging classes with abundant labeled data, enhancing the model's class-independent capabilities. Despite the wealth of real-world underwater acoustic target recognition data, the prevailing scenario often involves identifying these targets with numerous unlabeled samples and only a small number of labeled samples.
A Multi-Stage Learning Framework
Several research efforts in the field of Underwater Acoustic Target Recognition have utilized the concepts of self-supervised learning to address the problem of sparse labeled data. However, there are additional problems in this area, namely the dearth of strict performance standards for varying numbers of labeled samples and the little research done on the possible applications of unlabeled data. Furthermore, self-supervised learning techniques may still encounter challenges when fine-tuning with a limited number of labeled samples, especially in the few-shot learning domain, even if they rely less on labeled data.
The presented learning framework SimCLRv2 amalgamates advanced self-supervised learning techniques and is a simple framework for contrastive learning of visual representations and Bootstrap Your Own Talent (BYOL). It unveils a four-stage process for training models in underwater acoustic target recognition.
In the initial self-supervised learning phase, the model is primed with massive unlabeled samples via effective contrastive learning. A fully connected (FC) layer is then added to the model in the supervised fine-tuning stage, which uses a limited number of labeled samples to aid with classification. However, when confronted with an inadequate quantity of labeled samples, such as one or five per class, the supervised fine-tuning may encounter performance challenges.
In such scenarios, a semi-supervised fine-tuning step intervenes, aiming to boost performance by identifying and labeling select unlabeled samples based on deep feature similarity. Finally, an unsupervised self-distillation phase is enacted, employing the fine-tuned model as a teacher, and initializing a new model with extensive unlabeled data.
The proposed framework stands out in two ways. It encompasses four key stages, including semi-supervised fine-tuning, which caters to minimal labeled samples (1-shot, 5-shot, or 20-shot). Furthermore, BYOL is employed during self-supervised learning, enhancing its effectiveness, and leveraging unlabeled data more efficiently. The self-supervised learning stage employs BYOL to pre-train models with unlabeled data.
While BYOL is directly applied to single-branch models, adjustments are made for joint models with wave and T-F branches. The process includes independent projector and predictor networks for each branch, the computation of contrastive loss, and the utilization of deep features for feature extraction. The subsequent supervised fine-tuning stage appends an FC layer to the pre-trained model, employing limited labeled samples for training. In addition, synchronous deep mutual learning is introduced for joint models, enhancing their performance.
To address label noise and assign class information to unlabeled samples, a joint method is introduced that leverages deep feature similarity and a consistent matching module.
Datasets and Evaluation
Datasets: The DeepShip dataset serves as the evaluation benchmark in the current study. This dataset comprises 47 hours and four minutes of underwater recordings capturing 265 different ships across four categories. Seven distinct training datasets were created by the researchers: three with 10 percent, 5 percent, and 1 percent of the samples labeled, three with a very limited number of labeled samples (20-shot, 5-shot, 1-shot), and one full training dataset. There is only one testing dataset used, and none of the training datasets and this testing dataset overlap.
Data Augmentation and Evaluations: For the joint model, two data representations were included: wave and time frequency (T-F) representation. In the waveform, four data augmentation methods Pitch Shift, Speed Change, Random Gain, and Random Cropping are employed. For the T-F representation, four data augmentation methods Random Cropping, Contrast Change, Brightness Change, and Gaussian Blur are applied. The performance of the models in this study is evaluated through accuracy.
Results and Analysis
The experiments are organized into three distinct categories. Initially, empirical experiments assess the impact of data augmentation, establishing systematic performance benchmarks for four typical underwater acoustic target recognition models across seven unique training datasets. Secondly, comparative experiments evaluate model recognition performance under the proposed learning framework. Lastly, ablation experiments scrutinize the performance of the semi-supervised fine-tuning method.
In the empirical experiments, four common underwater acoustic target recognition models are trained on seven datasets, including MLENET, a classification model for underwater acoustic signals, lightweight MSRDN, a joint learning model for underwater acoustic regression, separable convolution-based autoencoder (SCAE), and the joint model. Comparative experiments under the proposed learning framework demonstrate significant improvements in model accuracy under few-shot conditions (1-shot, 5-shot, and 20-shot). These enhancements, ranging from 0.46 percent to 9.13 percent, indicate the framework's effectiveness in addressing the challenges posed by limited labeled data.
Conclusion
In summary, the current study delves into underwater acoustic target recognition under the challenges of limited labeled samples and abundant unlabeled data. A four-stage learning framework is introduced. With this framework, achieving optimal joint model performance requires labeling only 10 percent of the training dataset.
Article Revisions
- Oct 24 2023 - Minor edits to improve readability and change to main image.