In an article recently submitted to the ArXiv* server, researchers proposed a novel pretraining framework, designated as Scale-Aware Masked Autoencoder (Scale-MAE), and investigated its feasibility for remote sensing imagery.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Background
Remote sensing data is primarily captured from planes and satellites using a combination of viewing geometrics, sensors, and processing pipelines. The ground sample distance (GSD) of every image can vary from 0.3m to one km based on the relative sensor geometry to the Earth and sensor composition.
Thus, the data and the corresponding points of interest and objects within every image can differ across wide spatial ranges. Data obtained from these multiscale sensors provide complementary and critical information for different research and operational applications in environmental, agricultural, hydrologic, and atmospheric monitoring.
Multiscale remote sensing imagery has been explicitly addressed by only a few modern computer vision (CV) methods. However, large, pre-trained models are increasingly being used by the remote-sensing vision community. These pre-trained models are finetuned for a single data source at a specific scale.
The Scale-MAE model
In this study, researchers proposed a masked reconstruction model Scale-MAE which can explicitly learn relationships between data at known, different scales throughout the pretraining process, and leverage this information to produce a pre-trained model that performs efficiently across different tasks and GSDs.
MAEs offer self-supervised learning without any explicit augmentation. A standard MAE crops/resizes an image, masks the major portion of the transformed image, and then uses a Vision Transformer (ViT) based autoencoder to embed the unmasked components. Subsequently, a decoding ViT decodes the entire image based on these learned embeddings. Eventually, the decoder is discarded and the encoder is utilized to generate representations for an unmasked input image.
Scale-MAE is primarily a MAE-based self-supervised pretraining framework which made two significant contributions to the existing MAE framework. Standard MAE-based methods utilize relative or absolute positional encodings to inform the ViT of the unmasked component position, where an image at r resolution will possess similar positional encodings irrespective of the image content.
Thus, the existing pretraining approaches based on MAE cannot generalize across domains with images at different scales. However, the Scale-MAE introduced the GSD-based positional encoding that can scale proportionately to the area of land in the image irrespective of the image resolution, which informs ViT about both the scale and position of the input image.
Additionally, Scale-MAE also introduced the Laplacian-pyramid decoder to the MAE framework to enable the network to learn multiscale representations. ViT encoder embeddings were decoded to lower-resolution and higher-resolution images that can capture lower-frequency information and residual high-frequency information, respectively.
In this study, the Scale-MAE was used to pre-train a network by masking an input image at a known input scale, where the area of the Earth covered by the image determined the ViT positional encoding scale in place of image resolution. The Scale-MAE encoded the masked image using a standard ViT backbone and then decoded the masked image using a bandpass filter to reconstruct high-/low-frequency images at higher/lower scales.
Experimental evaluation
Researchers investigated the quality of representations from Scale-MAE pretraining by performing a set of experiments. These experiments assessed the robustness of the representations to scale and the representation transfer performance to additional tasks.
They evaluated the Scale-MAE representation quality by freezing the encoder and conducting a nonparametric k-nearest-neighbor (kNN) classification using eight remote sensing imagery classification datasets with various GSDs not encountered during pretraining. Subsequently, the performance of Scale-MAE was compared with ConvMAE, a state-of-the-art multiscale MAE, and SatMAE, a current state-of-the-art MAE for remote sensing imagery.
Researchers used the SpaceNetv1 building segmentation dataset to evaluate the semantic segmentation results on MAE-based and contrastive pretraining methods, including Scale-MAE, ConvMAE, SatMAE, Vanilla MAE, and supervised model trained from scratch Sup .(Scratch), and GASSL, which relied on PSANet and UperNet segmentation architecture.
Significance of the study
Scale-MAE outperformed SatMAE and ConvMAE across all evaluation datasets and ranges of GSDs except UC Merced with an average non-parametric kNN classification improvement of 5.6% and 2.4%, respectively. Additionally, the Scale-MAE outperformed both methods by a large margin with the increasing variation of the GSD from the original GSD, which indicated that Scale-MAE effectively learned representations that were more resilient to changes in scale for remote sensing imagery.
UC Merced at 100% of the true GSD was the only evaluation where SatMAE outperformed Scale-MAE. Moreover, Scale-MAE also achieved a 0.9 mIoU to 1.7 mIoU improvement on the SpaceNet building segmentation transfer task for a range of evaluation scales.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Journal reference:
- Preliminary scientific report.
Reed, C. J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Keutzer, K., Candido, S., Uyttendaele, M., Darrell, T. (2022). Scale-MAE: A Scale-Aware Masked Autoencoder for Multiscale Geospatial Representation Learning. ArXiv. https://doi.org/10.48550/arXiv.2212.14532, https://arxiv.org/abs/2212.14532