In a recent publication in the journal Electronics, researchers employed a Vision Transformer (ViT) for image classification and proposed a fashion recommendation system.
Background
Artificial intelligence (AI) is applied to a wide array of problems, including classification, object recognition, and recommendation. Deep learning takes the forefront in image classification, surpassing traditional machine learning. Deep learning automatically extracts image features, expediting processing. To improve image classification, many researchers employed convolutional neural networks (CNNs). CNNs delve deeper, capturing richer image features through various layers.
As e-commerce and online shopping burgeon, the fashion image classification experience has rapid growth, encompassing recognition, retrieval, recommendation, and fashion trend forecasting. Recommendation systems, driven by big data and user history, guide consumers to new products and are vital across all sectors, ensuring modern business satisfies individual preferences. Image recommendation, meanwhile, concerns identifying the most similar objects in image form and aligning the features of fashion products with consumers.
Numerous researchers have leveraged deep learning models and pre-trained CNN architectures to classify fashion images, while others have delved into fashion recommendation systems. For fashion image classification, the hierarchical CNN with VGG16 and VGG19 architectures achieved superior results with VGG16. Several studies conducted an in-depth comparison of DL architectures and enhanced the learning process for the best-performing architecture. For fashion recommendation, several works extracted features from pre-trained models such as CNNs and residual networks (ResNet50). The extracted features were ranked using k-nearest neighbor (k-NN) for style recommendations.
Materials and methods
In the classification stage, the research employs a variety of models, including pre-trained, proposed CNN architectures, and vision transformer (ViT) models, to classify fashion images. These models are assessed using two publicly available fashion image datasets: Fashion-MNIST and the fashion product dataset.
The CNN models developed for the current study, known as DeepCNN1, DeepCNN2, and DeepCNN3, encompass a range of layers, including convolutional, max pooling, batch normalization, dropout, and fully connected layers. These layers are vital in feature extraction and classification.
The ViT emerges as an innovative alternative to conventional CNNs in image classification. The architecture follows a systematic pattern, from input processing, patching, and encoding to transformer layers and feature processing. Key hyperparameters are meticulously configured to optimize model performance.
The study extends its focus to fashion recommendation systems, highlighting their significance in guiding users toward suitable products. These systems consider various traditional approaches, including cosine similarity and Pearson correlation. However, the research introduces an efficient recommendation system using the ViT model (DINOv2) for feature extraction and the FAISS library for a rapid nearest-neighbor search. The Gradio library facilitates a user-friendly interface for interacting with the system, allowing users to upload images and visualize nearest neighbor images from the dataset.
Efficiency is ensured through FAISS's flat L2 index structure, which rapidly computes Euclidean distances for nearest neighbor retrieval. The process includes index creation, distance calculation, nearest neighbor identification, and retrieval. This methodology demonstrates the effectiveness of the DINOv2 model in facilitating efficient and accurate fashion recommendations.
Experimental results
The datasets were divided into an 80 percent training set and a 20 percent testing set. To evaluate the proposed model, researchers employed metrics, including accuracy, precision, recall, F1-score, the AUC, and ROC curves. For the Fashion-MNIST dataset, the ViT model outperformed others with accuracy, precision, recall, and F1-score values around 95. Conversely, MobileNet exhibited the lowest performance, with the metrics not exceeding 59.
Comparatively, CNN models, especially Deep-CNN3, excelled in performance compared to pre-trained models. Deep-CNN3 achieved high accuracy. The ViT model demonstrated a two- to four-percent performance improvement over CNN models. The ViT model consistently achieved higher AUC values compared to Deep-CNN3.
For the fashion product dataset, ViT continued to excel, and MobileNet displayed the lowest results. Similar to the Fashion-MNIST dataset, CNN models, particularly Deep-CNN3, outperformed pre-trained models. The ViT model delivered a one to two-percent performance enhancement over CNN models.
Conclusion
In summary, researchers employed ViT to improve fashion image classification using public datasets. The ViT model’s performance is compared to CNN and pre-trained models. Results indicated that ViT outperformed other models. It achieves high accuracy for the Fashion-MNIST dataset and even higher accuracy for the fashion product dataset. The study also developed a fashion recommendation system using DINOv2 and FAISS for top-five image retrieval. Future work involves user studies, context-aware recommendations, hybrid approaches, explainability, and complexity analysis.