In a recent publication in the journal Electronics, researchers introduced a Siamese network constructed from a convolutional neural network (CNN), enhancing visual object tracking precision and resilience.
Background
Visual object tracking (VOT) is pivotal in computer vision, offering applications in factory automation, autonomous driving, surveillance, and drones. The challenge lies in tracking a target across video frames, usually initiated with a bounding box in the first frame. However, initial frame data often proves insufficient for consistent tracking, emphasizing the need for robust feature extraction.
Despite a decade of research, object tracking remains challenging due to real-world video complexities such as shape changes, lighting variations, and occlusions. Success hinges on information representing objects effectively against these issues, leading to various proposed solutions. Traditional appearance-based tracking employs handcrafted feature methods, but these may not respond well to environmental changes.
Comparing tracking approaches
The correlation filter-based tracking algorithm utilizes an appearance model generated through handcrafted filters and a fast Fourier transform (FFT) for computational efficiency. Techniques to enhance tracking accuracy include context learning and kernelized correlation filters. These filters generate strong signals, leading to correlation peaks in the target region and low responses in the background. Spatial regularization and multichannel features contribute to discriminative correlation filter (DCF) tracking, while improved kernelized correlation filters excel in overall performance.
Conversely, the CNN-based tracking algorithm harnesses deep features from convolution layers, recognized for their efficacy in computer vision tasks. Traditional methods rely on hand-crafted features such as color histograms, while recent advancements explore deep learning. Approaches include multilayer auto-encoder networks, two-layer CNN classifiers, and neural networks for target-specific saliency maps, ensuring tracking accuracy and robustness.
Revolutionizing visual object tracking: A novel architecture
The tracking algorithm presented in the study used images of the target object and search region as input to a fully deep CNN. The network, configured as a Siamese network with a Y-shaped branch, extracts essential features from both regions. These features are processed through an FFT layer within the Region Proposal Network (RPN) to classify objects and determine bounding box center coordinates.
The significance of features obtained from a CNN in computer vision cannot be overstated, particularly for robust VOT. Standard CNNs extract features through convolutional layers, passing results to a fully connected layer. However, this fully connected layer poses a limitation in terms of spatial location information retention, which is crucial for VOT.
To address this issue, the current study developed a custom network devoid of fully connected layers. This network relies on deeply stacked convolutional layers to preserve spatial and semantic information. The convolutional block compresses and expands the input feature map, allowing for the extraction of high-level features.
Siamese networks excel at comparing the similarity between two pairs of input data. These networks share all parameters, consist of identical Y-shaped branch structures, and extract data features using the same network. A distance function measures the similarity between extracted data features, distinguishing Siamese networks from traditional neural networks designed for multi-class prediction.
The Region Proposal Network (RPN) utilizes an FFT-based convolution process to predict bounding box coordinates. This process is conducted in the frequency domain, offering computational efficiency benefits. Anchor boxes, each associated with labels indicating object presence or absence, play a vital role in this prediction. The RPN classifies these anchor boxes through binary classification, resulting in probabilities for background and object-containing boxes. This process, conducted using FFT, ensures robustness and computational efficiency.
Evaluation and benchmark results of the proposed algorithm
The ImageNet Large Scale Visual Recognition Challenge video (ILSVRC 2015 VID) dataset, designed for object detection and consisting of video sequences, was used for training the tracking network. The dataset was split into training and validation sets, comprising 3862 and 555 video snippets, respectively.
The Object-Tracking Benchmark (OTB) dataset, including the OTB-100 and OTB-50 subsets, was used for quantitative evaluation.
Network training utilized pairs of images from the ILSVRC 2015 dataset, ignoring sequence order. Preprocessing included resizing images, maintaining the object's center point, and converting them to dimensions of 127 by 127 (target image) and 255 by 255 (search region). Anchor box coordinate labels and object classification labels were used in training. The loss function incorporated absolute error loss for anchor box coordinate estimation and cross-entropy for object classification. The final loss function combined these components.
The proposed algorithm achieved the highest precision and success scores in the OTB-50 benchmark dataset. Individual attribute results showed strengths in attributes such as occlusion, fast motion, out-of-view, motion blur, deformation, scale variation, and out-of-plane rotation. While the algorithm exhibited some weaknesses in attributes such as in-plane rotation, illumination variation, and background clutter, its overall tracking success rate was high.
Similarly, in the OTB-100 benchmark dataset, the proposed algorithm achieved the highest precision and success scores. Attribute-wise results demonstrated robustness in several attributes. The algorithm excelled in the illumination variation attribute's success score and the low-resolution attribute's precision score.
Conclusion
In summary, researchers introduced an object-tracking algorithm leveraging success and precision plots through training a Siamese network with the ILSVRC 2015 dataset. The RPN, incorporating the Siamese network and FFT convolution, handled tracking and object region inference. While the algorithm performed well in some video attributes, limitations were observed in in-plane rotation and background clutter scenarios. Future work will focus on enhancing tracking by extracting essential information within small regions, utilizing graph-based feature selection, and analyzing contextual relationships to improve overall performance.