Revolutionizing Object Tracking with Siamese Networks and CNN-Based Techniques

Download PDF Copy

By Dr. Sampath LonkaReviewed by Susha Cheriyedath, M.Sc.Oct 8 2023

In a recent publication in the journal Electronics, researchers introduced a Siamese network constructed from a convolutional neural network (CNN), enhancing visual object tracking precision and resilience.

*Study: Revolutionizing Object Tracking with Siamese Networks and CNN-Based Techniques. Image credit: NicoElNino/Shutterstock*

Background

Visual object tracking (VOT) is pivotal in computer vision, offering applications in factory automation, autonomous driving, surveillance, and drones. The challenge lies in tracking a target across video frames, usually initiated with a bounding box in the first frame. However, initial frame data often proves insufficient for consistent tracking, emphasizing the need for robust feature extraction.

Despite a decade of research, object tracking remains challenging due to real-world video complexities such as shape changes, lighting variations, and occlusions. Success hinges on information representing objects effectively against these issues, leading to various proposed solutions. Traditional appearance-based tracking employs handcrafted feature methods, but these may not respond well to environmental changes.

Comparing tracking approaches

The correlation filter-based tracking algorithm utilizes an appearance model generated through handcrafted filters and a fast Fourier transform (FFT) for computational efficiency. Techniques to enhance tracking accuracy include context learning and kernelized correlation filters. These filters generate strong signals, leading to correlation peaks in the target region and low responses in the background. Spatial regularization and multichannel features contribute to discriminative correlation filter (DCF) tracking, while improved kernelized correlation filters excel in overall performance.

Conversely, the CNN-based tracking algorithm harnesses deep features from convolution layers, recognized for their efficacy in computer vision tasks. Traditional methods rely on hand-crafted features such as color histograms, while recent advancements explore deep learning. Approaches include multilayer auto-encoder networks, two-layer CNN classifiers, and neural networks for target-specific saliency maps, ensuring tracking accuracy and robustness.

Revolutionizing visual object tracking: A novel architecture

The tracking algorithm presented in the study used images of the target object and search region as input to a fully deep CNN. The network, configured as a Siamese network with a Y-shaped branch, extracts essential features from both regions. These features are processed through an FFT layer within the Region Proposal Network (RPN) to classify objects and determine bounding box center coordinates.

The significance of features obtained from a CNN in computer vision cannot be overstated, particularly for robust VOT. Standard CNNs extract features through convolutional layers, passing results to a fully connected layer. However, this fully connected layer poses a limitation in terms of spatial location information retention, which is crucial for VOT.

To address this issue, the current study developed a custom network devoid of fully connected layers. This network relies on deeply stacked convolutional layers to preserve spatial and semantic information. The convolutional block compresses and expands the input feature map, allowing for the extraction of high-level features.

Siamese networks excel at comparing the similarity between two pairs of input data. These networks share all parameters, consist of identical Y-shaped branch structures, and extract data features using the same network. A distance function measures the similarity between extracted data features, distinguishing Siamese networks from traditional neural networks designed for multi-class prediction.

The Region Proposal Network (RPN) utilizes an FFT-based convolution process to predict bounding box coordinates. This process is conducted in the frequency domain, offering computational efficiency benefits. Anchor boxes, each associated with labels indicating object presence or absence, play a vital role in this prediction. The RPN classifies these anchor boxes through binary classification, resulting in probabilities for background and object-containing boxes. This process, conducted using FFT, ensures robustness and computational efficiency.

Evaluation and benchmark results of the proposed algorithm

The ImageNet Large Scale Visual Recognition Challenge video (ILSVRC 2015 VID) dataset, designed for object detection and consisting of video sequences, was used for training the tracking network. The dataset was split into training and validation sets, comprising 3862 and 555 video snippets, respectively.

The Object-Tracking Benchmark (OTB) dataset, including the OTB-100 and OTB-50 subsets, was used for quantitative evaluation.

Network training utilized pairs of images from the ILSVRC 2015 dataset, ignoring sequence order. Preprocessing included resizing images, maintaining the object's center point, and converting them to dimensions of 127 by 127 (target image) and 255 by 255 (search region). Anchor box coordinate labels and object classification labels were used in training. The loss function incorporated absolute error loss for anchor box coordinate estimation and cross-entropy for object classification. The final loss function combined these components.

The proposed algorithm achieved the highest precision and success scores in the OTB-50 benchmark dataset. Individual attribute results showed strengths in attributes such as occlusion, fast motion, out-of-view, motion blur, deformation, scale variation, and out-of-plane rotation. While the algorithm exhibited some weaknesses in attributes such as in-plane rotation, illumination variation, and background clutter, its overall tracking success rate was high.

Similarly, in the OTB-100 benchmark dataset, the proposed algorithm achieved the highest precision and success scores. Attribute-wise results demonstrated robustness in several attributes. The algorithm excelled in the illumination variation attribute's success score and the low-resolution attribute's precision score.

Conclusion

In summary, researchers introduced an object-tracking algorithm leveraging success and precision plots through training a Siamese network with the ILSVRC 2015 dataset. The RPN, incorporating the Siamese network and FFT convolution, handled tracking and object region inference. While the algorithm performed well in some video attributes, limitations were observed in in-plane rotation and background clutter scenarios. Future work will focus on enhancing tracking by extracting essential information within small regions, utilizing graph-based feature selection, and analyzing contextual relationships to improve overall performance.

Journal reference:

Lim, Su-Chang, Jun-Ho Huh, and Jong-Chan Kim. (2023). Siamese Trackers Based on Deep Features for Visual Tracking, Electronics 12, no. 19: 4140. DOI: https://doi.org/10.3390/electronics12194140, https://www.mdpi.com/2079-9292/12/19/4140

Posted in: AI Research News

Comments (0)

Written by

Dr. Sampath Lonka

Dr. Sampath Lonka is a scientific writer based in Bangalore, India, with a strong academic background in Mathematics and extensive experience in content writing. He has a Ph.D. in Mathematics from the University of Hyderabad and is deeply passionate about teaching, writing, and research. Sampath enjoys teaching Mathematics, Statistics, and AI to both undergraduate and postgraduate students. What sets him apart is his unique approach to teaching Mathematics through programming, making the subject more engaging and practical for students.

Download PDF Copy

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

APA
Lonka, Sampath. (2023, October 08). Revolutionizing Object Tracking with Siamese Networks and CNN-Based Techniques. AZoAi. Retrieved on July 05, 2025 from https://www.azoai.com/news/20231008/Revolutionizing-Object-Tracking-with-Siamese-Networks-and-CNN-Based-Techniques.aspx.
MLA
Lonka, Sampath. "Revolutionizing Object Tracking with Siamese Networks and CNN-Based Techniques". AZoAi. 05 July 2025. <https://www.azoai.com/news/20231008/Revolutionizing-Object-Tracking-with-Siamese-Networks-and-CNN-Based-Techniques.aspx>.
Chicago
Lonka, Sampath. "Revolutionizing Object Tracking with Siamese Networks and CNN-Based Techniques". AZoAi. https://www.azoai.com/news/20231008/Revolutionizing-Object-Tracking-with-Siamese-Networks-and-CNN-Based-Techniques.aspx. (accessed July 05, 2025).
Harvard
Lonka, Sampath. 2023. Revolutionizing Object Tracking with Siamese Networks and CNN-Based Techniques. AZoAi, viewed 05 July 2025, https://www.azoai.com/news/20231008/Revolutionizing-Object-Tracking-with-Siamese-Networks-and-CNN-Based-Techniques.aspx.