In an article recently published in the journal Scientific Reports, researchers proposed the novel T-Max-Avg pooling layer for convolutional neural networks (CNNs) to address the limitations of traditional pooling methods.
Background
CNNs are composed of batch normalization layers, pooling layers, activation layers, and convolutional layers, which are designed for feature extraction from raw data, specifically in the field of image analysis. The pooling layer is one of the crucial components in CNNs and is primarily utilized after the convolutional layer to reduce the parameter count and data volume and decrease the feature maps’ spatial dimensions.
This layer functions by partitioning every feature map into fixed-size regions and calculating the average value/average pooling or selecting the maximum value/max pooling within every region. Average pooling and max pooling are the two primary types of pooling operations.
However, these standard pooling operations are not effective for all data types and applications, which necessitated the development of custom pooling layers capable of learning and extracting relevant features adaptively from specific datasets.
The proposed approach
In this study, researchers proposed a novel approach for designing and implementing customizable pooling layers to improve the feature extraction capabilities in CNNs. The proposed T-Max-Avg pooling layer incorporated a threshold parameter T, which selects the K highest interacting pixels from the input data to control whether the input data output features are based on weighted averages or the maximum values.
Thus, the custom pooling layer can capture and represent discriminative information effectively in the input data by learning the optimal pooling strategy during training, which improves classification performance. The objective of the study was to develop a new pooling method to overcome the limitations of conventional pooling functions, specifically preventing the loss of highly representative values that can be disregarded by conventional pooling methods and ensuring appropriate representation of these values.
To realize this goal, researchers introduced K pixels with the highest representational capacity and incorporated a learning parameter T to compute the average and maximum values of these pixels depending on crucial feature information. This strategy can reduce the drawbacks of average pooling and max pooling methods while eliminating noise.
Evaluation of the approach
Three benchmark datasets, including MNIST, CIFAR-100, and CIFAR-10, and the LeNet-5 CNN model were used to perform experiments to compare the proposed T-Max-Avg pooling method with conventional pooling methods, including Avg-TopK, maximum, and average pooling methods.
Researchers selected the LeNet-5 model due to its robust classification capabilities and simple structure. The LeNet-5 network structure was composed of seven layers, including two fully connected layers, two pooling layers, and three convolutional layers. The CIFAR-10 dataset containing 60,000 color images categorized into 10 distinct classes that include trucks, frogs, ships, horses, dogs, cats, deer, birds, airplanes, and cars is used extensively for computer vision (CV) tasks.
Additionally, the CIFAR-100 dataset is the CIFAR-10 dataset’s extended version used for CV tasks. This dataset consists of 100 fine-grained categories that cover various everyday items, animals, and objects, with every category having 600 images, resulting in 60,000 images.
The MNIST dataset is a handwritten digit recognition dataset utilized extensively in machine learning (ML) and CV research and education. This dataset contains samples representing digits from 0 to 9, with every sample being a 28x28 pixel grayscale image.
Moreover, the effectiveness of the T-Max-Avg pooling technique in transfer learning models, including ChestX, ResNet50, and VGG19, was investigated by performing a series of experiments using the CIFAR-10 dataset. Researchers used the Google Colab for experiments using the LeNet-5 model and a device with NVIDIA GeForce GTX 1050 for the extended experiments.
Significance of the study
Experiments performed using the LeNet-5 CNN model on three datasets demonstrated that the max pooling method was more effective compared to the average pooling method. However, the proposed T-Max-Avg method outperformed both average pooling and max pooling and improved the accuracy of the Avg-TopK pooling method.
The T-Max-Avg pooling method achieved the highest accuracy among all pooling methods on CIFAR-10, CIFAR-100, and MNIST datasets. In the proposed method, selecting a T value of 0.7, K value of six, and pool size of four yielded the highest score for color images, and selecting a T value of 0.8, K value of three, and pool size of three resulted in the highest score for grayscale images.
These results indicated that the T-Max-Avg method can capture feature information more accurately and provide better results during the model training process. Moreover, the T-Max-Avg method attained better results on ChestX and ResNet50 transfer learning models compared to conventional pooling methods when it was applied to these models.
To summarize, the findings of this study demonstrated that the proposed method could effectively address the limitations of conventional pooling methods and offer an alternative option for the expansion and development of existing methods in the field.