The field of computer vision has made noteworthy advancements in the detection of text within scenes, which holds great importance in a variety of practical contexts, including but not limited to scene comprehension, text recognition, and self-driving vehicles. Due to text diversity, intricacy, and overlap, scene picture-dense text detection is difficult.
A recent study published in the journal Sensors proposes a method called “DenseTextPVT” to differentiate scenes with dense text occurrences efficiently. It introduces a novel approach for precise dense text prediction utilizing the pyramid vision transformer (PvTv2) versatile backbone technology. This backbone is specifically engineered to attain superior output resolution for dense prediction jobs in object detection while simultaneously minimizing resource consumption through a progressive shrinking pyramid.
The study aims to expand the receptive fields while maintaining high-resolution features, a critical aspect for successfully completing dense prediction jobs. To improve feature representation, the deep multiscale feature refinement network (DMFRN) identifies texts of various dimensions, forms, and typefaces. In post-processing, DenseTextPVT clusters pixels of text into proper text kernels using pixel aggregation similarity vector methods.
This method improves text detection and eliminates overlapping regions of text under dense nearby text in natural pictures. According to extensive trials, the solution outperforms previous methods on ICDAR-2015, CTW1500, and TotalText yardstick datasets.
Study: DenseTextPVT: Pyramid Vision Transformer with Deep Multi-Scale Feature Refinement Network for Dense Text Detection. Image Credit: Photo smile / Shutterstock
What is DenseTextPVT?
Transformers have garnered significant attention in the realm of computer vision research, as it has simplified end-to-end anchor generation and post-processing. ViT, a computer vision transformer architecture, performs well in image classification tasks by directly applying the transformer to sequences of picture patches.
PvTv2 proposed a flexible backbone that could achieve high output resolution for many vision applications, particularly dense prediction tasks while reducing time consumption by inheriting CNN and transformer benefits. A scene text detector needs robust representations and discriminative characteristics to detect text sections accurately. PvTv2 can represent dense prediction jobs in picture applications such as image classification, object recognition, and semantic segmentation. DenseTextPVT uses the PvTv2 architecture to improve dense scene text recognition features.
What does this study involve?
Some of the earlier works in this direction work well for text lines as well as line length changes. However, they still need help with extensive text overlaps, especially short fonts. Researchers tried to enlarge regions of text from their kernels to overcome overlap problems, yet they failed to achieve competitive scene text identification performance.
In order to address the aforementioned obstacles, the present methodology employs a multi-scale approach utilizing three distinct kernel filters and attention techniques, which the authors refer to as the deep multi-scale feature refinement network (DMFRN). The approach employed in this study involves the creation and integration of multi-level features, which serve to furnish all-encompassing representations for instances of text within a given scene.
Furthermore, this work draws inspiration from the advantages of the transformer that was utilized to simplify the intricate process of object detection by removing the requirement of manually designed procedures and enhancing the comprehension of spatial arrangement and contextual information. In contrast to the initial backbone, this approach incorporates a channel attention module (CAM) and spatial attention module (SAM) at various feature levels to efficiently capture and utilize significant features in both the spatial and channel-wise dimensions.
Major findings
The principal contributions of this study are outlined as follows:
- The authors present a novel object detection methodology that relies on the benefits of a dense prediction backbone. The proposed PvTv2 has two modules, viz. channel attention and spatial attention. This approach enables the extraction of high-resolution features to predict dense text in natural scene images.
- At each feature, it uses a deep multi-scale feature refinement network (DMFRN) with three kernel filters (3 × 3, 5 × 5, 7 × 7) and CBAM. As such, this model enriches feature representations of diverse sizes, including small representations.
- The DenseTextPVT model attains notably elevated Precision (P) and F-measure (F) values of 89.4% and 84.7%, respectively, without being dependent on any external dataset.
- It exhibits robust results on the CTW1500 benchmark with a curved structure. The achieved P and F scores are 88.3% and 83.9%, respectively. Several algorithms, including TextRay, DB, PAN, and CRAFT, exhibit marginally superior Recall (R) scores. However, this approach surpasses the existing algorithms in overall performance.
Conclusion
The current investigation presents a novel approach, referred to as DenseTextPVT, which is designed to identify closely situated textual content within a given scene. The proposed approach involves the manipulation of the PvTv2 backbone through the integration of both channel and spatial attention modules to achieve dense prediction. Additionally, a deep multi-scale feature refinement network is utilized to acquire multi-level feature information effectively.
Subsequently, a post-processing methodology is adopted in PAN to mitigate the issue of overlapping occurrences within text regions. The outcomes exhibit better performance than the most advanced techniques on various widely used benchmark datasets.