Evolution and Advancements in Computer Vision-led Object Detection

Download PDF Copy

By Dr. Sampath LonkaReviewed by Susha Cheriyedath, M.Sc.

Object detection represents a fundamental and intricate task in computer vision. It centers on precisely pinpointing objects within images and videos, commanding significant attention in research over the years. This attention has catalyzed advancements in various facets of computer vision, encompassing tasks such as object classification, object counting, and object monitoring. Many methods and frameworks for object detection have been painstakingly developed during the past 20 years.

*Image credit: MONOPOLY919/Shutterstock*

The Evolution of Object Detection

Historically, before the era of deep learning, object detection was segmented into three key steps: proposal generation, feature vector extraction, and region classification. Proposal generation involved searching for object-containing regions, often using sliding windows.

Feature vectors captured discriminative semantic information at each location. These vectors were typically encoded with low-level visual descriptors, such as scale-invariant feature transform (SIFT), Haar, histogram of gradients (HOG), or speed-up robust features (SURF). In the final step, region classifiers were employed to assign categorical labels to these regions.

While these traditional methods achieved impressive results, they had limitations, including the generation of redundant proposals, the manual design of window scales, and the reliance on hand-crafted low-level visual cues. The pipeline steps were optimized separately, hindering a globally optimal solution for the entire system. The advent of deep convolutional neural networks (DCNN) marked a significant shift in object detection.

These networks offered hierarchical feature representations, automatically learned from training data, enabling more discriminative expression in complex contexts. Deep learning's power became evident with large-scale datasets such as ImageNet, which facilitated training deep models. DCNN provided more powerful feature representations, which could be optimized end-to-end.

Deep Learning Models for Object Detection

State-of-the-art object detectors using deep learning fall into two main categories: two-stage detectors and one-stage detectors. Two-stage detectors are known for their high accuracy but have slower inference speeds. In contrast, one-stage detectors are faster but may have slightly lower performance.

Two-stage Detectors: Two-stage detectors split the detection task into two stages: proposal generation and making predictions for these proposals. In the proposal generation phase, the detector identifies potential object regions in the image. The deep learning model is used to classify these proposals and refine their localization. The pioneering two-stage object detector, region-based CNN (R-CNN), greatly enhanced detection performance by producing a sparse set of proposals and classifying them with a deep convolutional neural network. However, there were disadvantages as well, such as redundant calculations and lengthy training and testing. By employing region of interest (ROI) pooling to extract region features, Fast R-CNN overcame the drawbacks of SPP-net and allowed for end-to-end optimization of the entire detection framework. Faster R-CNN created the Region Proposal Network, which generated state-of-the-art outcomes by learning proposal creation in a data-driven manner.

One-stage Detectors: One-stage detectors do not have a separate proposal generation stage and directly classify all regions as potential objects or backgrounds. OverFeat performed object detection by casting a deep convolutional neural network classifier into a fully convolutional object detector, considering object detection as a "multi-region classification" problem.

To forecast item existence, bounding box coordinates, and class, the You Only Look Once (YOLO) model divided the image into grid cells and treated object recognition as a regression issue. SSD (Single-Shot Multibox Detector) improved upon YOLO by using a set of anchors with multiple scales and aspect ratios to discretize the output space of bounding boxes.

RetinaNet addressed the class imbalance problem in one-stage detectors by using focal loss, which suppressed gradients of easy negative samples. YOLOv2 improved detection performance and maintained real-time inference speed using a more powerful deep convolutional backbone architecture. YOLOv3 incorporated incremental improvements with a larger network, data augmentation, and other techniques. However, it did not bring groundbreaking changes.

CenterNet models objects as points rather than bounding boxes, using a heatmap to determine object centers. It is accurate and highly precise for various tasks, but it requires specific backbone architectures. YOLOv4 introduces various techniques to improve training and inference. It is fast, easy to train, and suitable for production systems. Inspired by the success of the transformer in natural language processing, Swin Transformer provides a transformer-based framework for computer vision applications. It splits images into patches and applies Swin Transformer blocks to produce state-of-the-art outcomes with datasets such as Microsoft Common Objects in Context (COCO).

These models have advanced single-stage object detection with various innovations, addressing speed, accuracy, and efficiency. The model selected is determined by application-specific requirements and trade-offs.

Applications of Objection Detection

Object detection finds extensive applications across various industries and domains. Computer vision tasks serve many purposes, including but not limited to security, image retrieval, surveillance, machine inspection, autonomous vehicle systems, and numerous other fields.

Pedestrian Detection: Pedestrian detection holds substantial significance in various applications, encompassing fields such as video surveillance, autonomous driving, and robotics. In intelligent video surveillance, pedestrian detection enhances semantic understanding and bolsters safety by monitoring pedestrian movements.

Object detection is a critical component of autonomous driving systems to identify adjacent things, such as cars, pedestrians, road signs, traffic signals, and more. This information aids the car in making decisions regarding braking and turning, offering independence to visually impaired individuals.

Researchers have created diverse datasets to address challenges in pedestrian detection, including small object detection, dense and occluded pedestrians, real-time detection, and varying weather conditions. Many studies have contributed to the improvement of pedestrian detection techniques.

Notably, a social distancing monitoring network has been developed, employing architectures such as PeeleeNet and YOLOv3. Various publications have focused on datasets such as CityPersons, Caltech, and COCO, utilizing faster R-CNN and different YOLO versions in the context of deep learning advancements.

Face Detection and Recognition: Face recognition, frequently used in biometrics, validates individuals from photos or videos, whereas face detection recognizes human faces inside digital images. This technology is integrated into phone unlocking and has broad applications, enhancing security in banks, airports, retail stores, and biometric surveillance.

Face recognition plays a role in smart home security systems, where faces activate sensors, and facial recognition-based security systems have been proposed. Research explores advanced face detection techniques, including face detection in collaborative learning environments, complexion-based detection, and deep learning methods. Face detection extends to recognizing facial expressions, ensuring liveliness, and addressing spoofing attacks through heart-rate measurements. Novel methods like CattleFaceNet aim to enhance face recognition accuracy and adaptability under diverse conditions.

These developments are pivotal for visual impairment assistive devices, cattle identification, and remote work during the COVID-19 pandemic. Recent publications continue to advance the field, offering innovative solutions for various datasets.

Other Applications: In recent years, object detection methods have significantly contributed to areas such as healthcare and autonomous vehicles. It can be harnessed for object recognition as image search, where objects within images are detected, labeled, and utilized for image retrieval via URLs.

Object counting, whether estimating crowd sizes during festivals or quantifying objects in real-time videos, is another utility. Automatic image annotation involves automatically assigning metadata, such as captions or keywords, to digital images, streamlining organization and retrieval.

Object extraction, closely related to image segmentation, enhances the meaningful representation of objects. Image segmentation divides the image into subparts based on color or intensity, while object extraction refines these segmented parts by allowing users to designate background and foreground regions. This technology is already used for changing image backgrounds and holds the potential for extracting objects from videos with further development.

Key Challenges in Object Detection

In the past decade, computer vision has made significant strides, yet it grapples with noteworthy challenges. These challenges, pertinent to real-world applications, encompass several key aspects:

Intra-Class Variation: Variability within instances of the same object is a frequent occurrence in the natural world. Factors such as occlusion, illumination, pose, viewpoint, and more contribute to this variation, significantly impacting object appearance. This can involve non-rigid deformations, rotations, scaling, blurriness, or inconspicuous surroundings, rendering object extraction complex.

Number of Categories: The sheer multitude of object classes available for classification poses a formidable problem. It necessitates extensive, high-quality annotated data, which can be scarce. An ongoing research question revolves around the effectiveness of training detectors with a limited number of examples.

Efficiency: Contemporary models demand substantial computational resources to yield precise detection outcomes. As mobile and edge devices become increasingly prevalent, the development of efficient object detectors is paramount for the progression of computer vision.

References and Further Readings

Kaur, J., and Singh, W. (2022). Tools, techniques, datasets, and application areas for object detection in an image: a review, Multimedia Tools and Applications, 81(27), 38297-38351. https://doi.org/10.1007/s11042-022-13153-y

Zou, Z., Chen, K., Shi, Z., Guo, Y., and Ye, J. (2023). Object detection in 20 years: A survey. Proceedings of the IEEE. https://doi.org/10.1109/JPROC.2023.3238524

Zaidi, S. S. A., Ansari, M. S., Aslam, A., Kanwal, N., Asghar, M., and Lee, B. (2022). A survey of modern deep learning-based object detection models. Digital Signal Processing, 126, 103514. https://doi.org/10.1016/j.dsp.2022.103514

Last Updated: Oct 16, 2023

Written by

Dr. Sampath Lonka

Dr. Sampath Lonka is a scientific writer based in Bangalore, India, with a strong academic background in Mathematics and extensive experience in content writing. He has a Ph.D. in Mathematics from the University of Hyderabad and is deeply passionate about teaching, writing, and research. Sampath enjoys teaching Mathematics, Statistics, and AI to both undergraduate and postgraduate students. What sets him apart is his unique approach to teaching Mathematics through programming, making the subject more engaging and practical for students.

Download PDF Copy

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

APA
Lonka, Sampath. (2023, October 16). Evolution and Advancements in Computer Vision-led Object Detection. AZoAi. Retrieved on July 01, 2025 from https://www.azoai.com/article/Evolution-and-Advancements-in-Computer-Vision-led-Object-Detection.aspx.
MLA
Lonka, Sampath. "Evolution and Advancements in Computer Vision-led Object Detection". AZoAi. 01 July 2025. <https://www.azoai.com/article/Evolution-and-Advancements-in-Computer-Vision-led-Object-Detection.aspx>.
Chicago
Lonka, Sampath. "Evolution and Advancements in Computer Vision-led Object Detection". AZoAi. https://www.azoai.com/article/Evolution-and-Advancements-in-Computer-Vision-led-Object-Detection.aspx. (accessed July 01, 2025).
Harvard
Lonka, Sampath. 2023. Evolution and Advancements in Computer Vision-led Object Detection. AZoAi, viewed 01 July 2025, https://www.azoai.com/article/Evolution-and-Advancements-in-Computer-Vision-led-Object-Detection.aspx.