In a recent publication in the journal Scientific Data, researchers introduced DeepPatent2, a sizable dataset to address the challenges of producing accurate descriptions for sketched images in technical documents.
Background
Technical illustrations, sketches, and drawings serve as visual aids to convey information efficiently. In computer vision, the challenge lies in comprehending the intricacies of these images, encompassing object recognition, attribute determination, and contextual understanding.
While conventional datasets such as Microsoft Common Objects in Context (MS COCO) and ImageNet focus on natural images, technical drawings found in design patents offer a unique set of challenges. These drawings, though lacking the color and environmental details of natural images, provide essential abstraction, emphasizing strokes and lines that maintain human recognizability.
Despite the significance of technical drawings, they remain understudied in computer vision and information retrieval. Existing sketch datasets, such as QuickDraw and Cross-Language Evaluation Forum-Intellectual Property (CLEF-IP) 2011, fall short of capturing the rich semantic information present in technical drawings. DEEPPATENT, a prior dataset, lacked object identification and viewpoint descriptions. In response, DEEPPATENT2 emerges.
Crafting the DEEPPATENT2 dataset
DEEPPATENT2, an extensive dataset, encompasses over two million technical drawings derived from design patent documents published by the United States Patent and Trademark Office (USPTO) between 2007 and 2020, expanding upon the DEEPPATENT dataset in size, content, and metadata richness.
Scale and Composition: DEEPPATENT2 surpasses DEEPPATENT, offering more than a five-fold increase in volume. It includes both original and segmented patent drawings. The metadata for each drawing incorporates object names and viewpoints, meticulously extracted with high precision through a supervised sequence-tagging model.
Data Creation Pipeline: The process involves three key components: data acquisition, text processing, and image processing. Patent documents in XML (eXtensible Markup Language) and Tag Image File Format (TIFF) formats are acquired, with each TIFF file potentially containing multiple figures, referred to as compound figures. Text processing entails extracting human-readable object names from figure captions, overcoming challenges posed by compound figures. Image processing involves figure segmentation and metadata alignment, where a novel transfer learning method, Medical Transformer (MedT), proves effective in segmenting compound figures.
Text Processing-Entity Recognition: Entity recognition involves tokenizing and encoding text using pre-trained models such as Distil Bidirectional Encoder Representations from Transformers (DistilBERT). The sequence-tagging model, employing bidirectional Long Short-Term Memory with a conditional random field (BiLSTM-CRF) architecture, achieves high accuracy in recognizing object names and viewpoints, ensuring an F1-measure of 0.960 for overall entity recognition.
Image Processing-Figure Segmentation and Metadata Alignment: The identification of figure labels is accomplished using Amazon Rekognition, surpassing alternative optical character recognition (OCR) engines in terms of precision, recall, and F1 score. The segmentation of compound figures is executed using the Medical Transformer (MedT) model, showcasing enhanced performance when contrasted with baseline methods like point-shooting, U-Net, HR-Net, and Detection Transformer (DETR).
Data Records: The final dataset comprises two million compound PNG figures, 2.7 million segmented PNG figures, and JSON (JavaScript Object Notation) metadata organized by year. Metadata includes patent ID, original figure file, object names, viewpoints, figure labels, bounding boxes, and document-level information.
Semantic Information Extraction: The dataset yields 132,890 unique object names and 22,394 viewpoints. Analysis reveals diverse and disproportionate viewpoints, posing challenges for 3D reconstruction from 2D sketches.
DEEPPATENT2 is a comprehensive resource poised to propel advancements in diverse research areas, including 3D image reconstruction and image retrieval for technical drawings.
Technical validation of the DEEPPATENT2
The data, generated through advanced machine learning and deep learning methods, undergoes a meticulous validation process to address potential errors in figure label detection, compound image segmentation, label association, and entity recognition (ER). The overall error rate averaged at 7.5 percent, is approximated by considering errors in label association mismatches. This rate is expressed as a precision value.
While all figures are retained in the dataset, those with mismatches are marked in filenames for reference. Verifying error rates involves manual inspection of 1400 compound figures, confirming consistency with estimated error rates. The dataset aligns with comparable error rates in computer vision datasets, acknowledging the inherent challenges of automated tagging.
In demonstrating the dataset's utility, a conceptual captioning task is showcased, employing a variant of the residual network (ResNet-152) for image captioning on technical drawings. The dataset's potential extends to tasks such as technical drawing image retrieval, summarization of scholarly and technical corpora, 3D image reconstruction, figure segmentation, and technical drawing classification.
Furthermore, the dataset could contribute to the creation of generative and multimodal design models for innovation by combining generative adversarial networks (GAN) and diffusion models. The dataset's richness in detailed technical drawings enhances its value for training accurate multimodal generative models.
Conclusion
In summary, researchers introduced the DEEPPATENT2 dataset. Enriched with semantic details such as object names and multiple views, the dataset effectively addresses the shortcomings observed in its predecessors.
Leveraging a pioneering pipeline integrating natural language processing and computer vision methods, the proposed dataset demonstrates its utility through enhanced conceptual captioning model performance. The expansive dataset is poised to contribute to advancements in tasks such as 3D image reconstruction and image retrieval for technical drawings.