In an article recently published in the journal AI, the authors reviewed different types of few-shot fine-grained image classification (FSFGIC) methods, including global and/or local deep feature representation learning-based FSFGIC methods and class representation learning-based FSFGIC methods.
Background
FSFGIC methods are used to classify images belonging to various subclasses of the same species using a small number of labeled samples. FSFGIC methods better utilize limited sample information, significantly improve the generalization ability and classification accuracy, and learn a higher number of discriminative feature representations through feature representation learning to attain better outcomes in FSFGIC tasks.
Recently, FSFGIC has received significant attention, with different techniques proposed for FSFGIC tasks. Several few-shot learning methods have demonstrated impressive results while handling FSFGIC tasks. In this paper, the authors reviewed different FSFGIC methods, including class representation learning-based FSFGIC methods and global and/or local deep feature representation learning-based FSFGIC methods.
Class representation learning-based FSFGIC methods
Class representations can alleviate the overfitting phenomenon and represent a novel class effectively. Class representation learning-based methods are categorized as optimization-based class representation learning and metric-based class representation learning. For instance, an optimization-based FSFGIC method has been developed that includes a classifier mapping module and a bilinear feature learning module.
The classifier mapping module used a "piecewise mappings" function to map features to decision boundaries and encoded discriminative information. Similarly, an adaptive distribution calibration (ADC) method was proposed to address few-shot learning's distribution bias by adaptively calibrating and transferring information from base classes to enhance the classification performance of novel classes.
A novel transformer-based neural network architecture, designated as CrossTransformers, has been proposed that applies a cross-attention mechanism to identify the coarse spatial correspondence between the support and query-labeled samples in a class. Moreover, an end-to-end graph-based approach, designated as an explicit class knowledge propagation network (ECKPN), has been designed to explicitly propagate and learn the class representations.
A conditional feature generation model was developed by combining generative adversarial networks (GANs) and variational autoencoder (VAE) to address the problem of mode collapse in GANs-based feature generators. This model can learn the image features' conditional distribution and marginal distribution on the labeled class data and on the unlabeled class data, respectively.
Global and/or local deep feature representation learning-based FSFGIC methods
In the FSFGIC field, local deep feature representations can identify the discriminative regions to distinguish subtle variances of fine-grained features. The combination of local and global deep feature representation learning can effectively enhance the deep feature representation capability.
Currently, metric- and optimization-based techniques use global and/or deep feature representations to perform FSFGIC tasks. The current optimization-based methods for global and/or local deep feature representation learning primarily focus on fine-tuning techniques. They improve the performance of the model with limited training data through the integration of the fine-tuning process in the meta-training stage.
This evolutionary search approach can be embedded into an optimization-based method and a metric-based method to perform FSFGIC tasks. A more accurate and comprehensive image feature information representation can be realized by incorporating enhancement methods that integrate perception features, both local and global, into the feature space and add semantic orthogonality constraints.
A multi-attention meta-learning (MattML) method employed attention mechanisms in both the task learner and the base learner, using multiple attention mechanisms to obtain the feature information of local and subtle parts of an image. Similarly, an evolutionary search strategy has been proposed to transfer partial knowledge by fine-tuning specific base model layers after capturing the deep feature representations using the feature extractor.
Metric-based global and/or local deep feature representation learning methods are classified into six categories: multi-scale representation, semantic alignment, feature distribution, multi-model learning, metric strategy, and attention mechanism. A self-attention-based prototype enhancement network (SAPENet) was proposed in a study to capture a more representative prototype for every class, while an automatic salient region selection network was proposed without using a part annotation or bounding box mechanism to locate salient regions from images.
A domain-specific marine organisms' FSFGIC task was proposed and a feature fusion model was designed for focusing on key regions. Specifically, the feature fusion model, as the key component, used high-order integration and focus-area location to create feature representations that contain more identifiable information.
The issue of image classification was formalized as an optimal problem of image matching by the DeepEMD method. Then, the earth mover's distance (EMD) was used to choose local discriminative feature representations to find optimal matching between support and query samples. The Sinkhorn distance was utilized to identify an optimal matching between images to mitigate the object mismatch due to misaligned positions.
Multi-scale representation improves the global feature representation as the large scale with bigger receptive fields consists of richer information. For instance, a multi-scale second-order relation network (MsSoSN) equipped with a scale selector and second-order pooling was proposed for generating second-order multi-scale representations. A discrepancy and scale discriminator was also proposed to reweight the multi-scale features trained using the self-supervision method.
To summarize, the existing FSFGIC methods have made significant progress in FSFGIC tasks. However, more research is required to address several important challenges to FSFGIC, including the trade-off between the image feature representation ability and the overfitting problem, generalization in FSFGIC, and issues related to efficiency and performance.
Journal reference:
- Ren, J., Li, C., An, Y., Zhang, W., Sun, C. (2024). Few-Shot Fine-Grained Image Classification: A Comprehensive Review. AI, 5(1), 405-425. https://doi.org/10.3390/ai5010020, https://www.mdpi.com/2673-2688/5/1/20