In a paper published in the journal Pattern Recognition, researchers introduced a novel holistic-part unified model, TransOSV, based on the vision transformer framework for offline signature verification. The model encoded signature images into patch sequences using a transformer-based holistic encoder for global representation and employed a contrast-based part decoder with a sparsity loss to capture subtle local differences. This approach achieved state-of-the-art results on both writer-independent (WI) and writer-dependent (WD) signature verification tasks, showcasing its potential across various signature datasets.
Introduction
Offline signature verification aims to authenticate signatures by comparing them with reference signatures. This involves extracting signature features and making decisions based on feature distances. Extracting robust and discriminative signature features is a crucial challenge. Convolutional neural networks (CNNs) have enhanced verification performance by extracting features. However, CNN-based methods have limitations in addressing key issues.
Capturing relationships among signature strokes is essential for verification, but CNNs primarily use small receptive fields, limiting global context understanding and inter-dependency among strokes. Transformer networks have excelled in learning sequence features from a global context. They use self-attention instead of convolution and pooling, preserving detailed information. Considering the necessity to evaluate both global and local structural similarities, transformers seem highly suitable for tasks related to offline signature verification.
Related work
Previous studies have extensively investigated signature verification, with numerous works delving into the field. Over time, CNN-based approaches emerged as frontrunners, gradually surpassing conventional two-stage methodologies due to their end-to-end design and adept feature extraction capabilities. Incorporating multi-stream architectures and attention mechanisms, certain models brought dynamic interaction to offline signature verification. Addressing both WI and WD scenarios, region-based metric learning techniques were used for signature verification. Even in WD verification, CNNs demonstrated their utility. Furthermore, self-supervised representation learning methods were introduced to facilitate signature representation learning. However, the inherent limitation of CNN-based methods lies in their capacity to reason about global context, possibly constraining the efficacy and generalizability of the extracted features.
Proposed method
In this approach, TransOSV is introduced as a novel signature verification model based on a visual transformer framework. The TransOSV model incorporates two weight-shared holistic encoders, focusing on capturing both holistic features representing global representation and the relationships among signature strokes. Additionally, the holistic encoder preserves all image patch features for subsequent modules. These image patch features are transformed into feature maps, facilitating the extraction of convolution features, which implicitly encompass positional information via a convolution module. For local feature extraction, a novel contrast-based part decoder is developed to discern discriminative part features. This decoder takes convolution features refined by the earlier convolution module as input. Furthermore, a sparsity loss guides the decoder in learning the most distinctive part features.
By leveraging the holistic features from the encoder and the discriminative part features from the decoder, the proposed TransOSV model effectively enhances the performance of WI signature verification. To address sample imbalance, a new focal contrast loss (FC), inspired by the focal loss, is formulated. This loss function adaptively mines challenging samples to promote the learning of more robust features during training. Furthermore, the proposed model's applicability is extended to learning signature representations for WD signature verification tasks.
Experimental results
In the experiments, the TransOSV model is constructed following the ViT configuration. The encoder comprises eight transformer layers with a patch size of 16 and a step size of 10. Training involves 8 NVIDIA 3090 Graphics Processing Units (GPUs), each with a batch size of 64, using the PyTorch toolbox with FP16 training. Encoder weights are pre-trained on the ImageNet21K dataset. The Stochastic Gradient Descent (SGD) optimizer with a momentum of 0.9 and weight decay of 1e-4 is utilized.
This approach is evaluated across four challenging signature datasets: BHSig-H, BHSig-B, Genuine and Skilled Forgeries in Disguise Synthetic (GPDS), and CS749. The performance of the model is assessed using standard metrics such as False Rejection Rate (FRR), False Acceptance Rate (FAR), and Equal Error Rate (EER) for both WD and WI signature verification tasks. Notably, the TransOSV model significantly outperforms existing methods, demonstrating its effectiveness. In particular, for WI signature verification on the BHSig-H dataset, the model achieves an EER of 3.24%, surpassing the state-of-the-art by 1.26% and the original ViT by 4.65%. WD experiments on the CS749 dataset further validate the superiority of the TransOSV approach over a pre-trained model, showcasing its potential to learn offline signature representation effectively.
Furthermore, ablation studies confirm the benefits of holistic encoders, contrast-based part decoders, convolution modules, FC, and sparsity loss in enhancing the performance of the model. While pre-training on ImageNet greatly enhances the model's effectiveness, solely relying on it falls short of achieving results comparable to those of the TransOSV model, underscoring the comprehensive strengths of this approach.
Contribution of this paper
The key contributions of this study can be summarized as follows:
- Establishment of a new benchmark for transformer-based signature verification through the novel TransOSV model for offline signature verification.
- Utilization of a holistic encoder and a contrast-based part decoder for extracting both holistic and discriminative part features, which were validated through comprehensive ablation studies.
- Introduction of a novel FC function that synergizes with the proposed model, achieving state-of-the-art performance on various standard datasets for WI signature verification.
- Using the proposed model for learning signature representations in WD signature verification tasks, evaluating its generalization capacity.
Conclusion
To summarize, this study presents a novel transformer-based model for offline signature verification. It incorporates a holistic encoder and a contrast-based part decoder to capture both global and local features. Experimental evaluations on three datasets confirm the effectiveness of the proposed approach. Further work will focus on extending the model's validation to broader signature verification tasks.