Scale-MAE: A Novel Pretraining Framework for Improved Remote Sensing Imagery Analysis

In an article recently submitted to the ArXiv* server, researchers proposed a novel pretraining framework, designated as Scale-Aware Masked Autoencoder (Scale-MAE), and investigated its feasibility for remote sensing imagery.

Study: Scale-MAE: A Novel Pretraining Framework for Improved Remote Sensing Imagery Analysis. Image credit: M-Production/Shutterstock
Study: Scale-MAE: A Novel Pretraining Framework for Improved Remote Sensing Imagery Analysis. Image credit: M-Production/Shutterstock

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.

Background

Remote sensing data is primarily captured from planes and satellites using a combination of viewing geometrics, sensors, and processing pipelines. The ground sample distance (GSD) of every image can vary from 0.3m to one km based on the relative sensor geometry to the Earth and sensor composition.

Thus, the data and the corresponding points of interest and objects within every image can differ across wide spatial ranges. Data obtained from these multiscale sensors provide complementary and critical information for different research and operational applications in environmental, agricultural, hydrologic, and atmospheric monitoring.

Multiscale remote sensing imagery has been explicitly addressed by only a few modern computer vision (CV) methods. However, large, pre-trained models are increasingly being used by the remote-sensing vision community. These pre-trained models are finetuned for a single data source at a specific scale.

The Scale-MAE model

In this study, researchers proposed a masked reconstruction model Scale-MAE which can explicitly learn relationships between data at known, different scales throughout the pretraining process, and leverage this information to produce a pre-trained model that performs efficiently across different tasks and GSDs.

MAEs offer self-supervised learning without any explicit augmentation. A standard MAE crops/resizes an image, masks the major portion of the transformed image, and then uses a Vision Transformer (ViT) based autoencoder to embed the unmasked components. Subsequently, a decoding ViT decodes the entire image based on these learned embeddings. Eventually, the decoder is discarded and the encoder is utilized to generate representations for an unmasked input image.

Scale-MAE is primarily a MAE-based self-supervised pretraining framework which made two significant contributions to the existing MAE framework. Standard MAE-based methods utilize relative or absolute positional encodings to inform the ViT of the unmasked component position, where an image at r resolution will possess similar positional encodings irrespective of the image content.

Thus, the existing pretraining approaches based on MAE cannot generalize across domains with images at different scales. However, the Scale-MAE introduced the GSD-based positional encoding that can scale proportionately to the area of land in the image irrespective of the image resolution, which informs ViT about both the scale and position of the input image.

Additionally, Scale-MAE also introduced the Laplacian-pyramid decoder to the MAE framework to enable the network to learn multiscale representations. ViT encoder embeddings were decoded to lower-resolution and higher-resolution images that can capture lower-frequency information and residual high-frequency information, respectively.

In this study, the Scale-MAE was used to pre-train a network by masking an input image at a known input scale, where the area of the Earth covered by the image determined the ViT positional encoding scale in place of image resolution. The Scale-MAE encoded the masked image using a standard ViT backbone and then decoded the masked image using a bandpass filter to reconstruct high-/low-frequency images at higher/lower scales.

Experimental evaluation

Researchers investigated the quality of representations from Scale-MAE pretraining by performing a set of experiments. These experiments assessed the robustness of the representations to scale and the representation transfer performance to additional tasks.

They evaluated the Scale-MAE representation quality by freezing the encoder and conducting a nonparametric k-nearest-neighbor (kNN) classification using eight remote sensing imagery classification datasets with various GSDs not encountered during pretraining. Subsequently, the performance of Scale-MAE was compared with ConvMAE, a state-of-the-art multiscale MAE, and SatMAE, a current state-of-the-art MAE for remote sensing imagery.

Researchers used the SpaceNetv1 building segmentation dataset to evaluate the semantic segmentation results on MAE-based and contrastive pretraining methods, including Scale-MAE, ConvMAE, SatMAE, Vanilla MAE, and supervised model trained from scratch  Sup .(Scratch), and GASSL, which relied on PSANet and UperNet segmentation architecture. 

Significance of the study

Scale-MAE outperformed SatMAE and ConvMAE across all evaluation datasets and ranges of GSDs except UC Merced with an average non-parametric kNN classification improvement of 5.6% and 2.4%, respectively. Additionally, the Scale-MAE outperformed both methods by a large margin with the increasing variation of the GSD from the original GSD, which indicated that Scale-MAE effectively learned representations that were more resilient to changes in scale for remote sensing imagery.

UC Merced at 100% of the true GSD was the only evaluation where SatMAE outperformed Scale-MAE. Moreover, Scale-MAE also achieved a 0.9 mIoU to 1.7 mIoU improvement on the SpaceNet building segmentation transfer task for a range of evaluation scales.

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.

Journal reference:
Samudrapom Dam

Written by

Samudrapom Dam

Samudrapom Dam is a freelance scientific and business writer based in Kolkata, India. He has been writing articles related to business and scientific topics for more than one and a half years. He has extensive experience in writing about advanced technologies, information technology, machinery, metals and metal products, clean technologies, finance and banking, automotive, household products, and the aerospace industry. He is passionate about the latest developments in advanced technologies, the ways these developments can be implemented in a real-world situation, and how these developments can positively impact common people.

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Dam, Samudrapom. (2023, October 05). Scale-MAE: A Novel Pretraining Framework for Improved Remote Sensing Imagery Analysis. AZoAi. Retrieved on October 05, 2024 from https://www.azoai.com/news/20231005/Scale-MAE-A-Novel-Pretraining-Framework-for-Improved-Remote-Sensing-Imagery-Analysis.aspx.

  • MLA

    Dam, Samudrapom. "Scale-MAE: A Novel Pretraining Framework for Improved Remote Sensing Imagery Analysis". AZoAi. 05 October 2024. <https://www.azoai.com/news/20231005/Scale-MAE-A-Novel-Pretraining-Framework-for-Improved-Remote-Sensing-Imagery-Analysis.aspx>.

  • Chicago

    Dam, Samudrapom. "Scale-MAE: A Novel Pretraining Framework for Improved Remote Sensing Imagery Analysis". AZoAi. https://www.azoai.com/news/20231005/Scale-MAE-A-Novel-Pretraining-Framework-for-Improved-Remote-Sensing-Imagery-Analysis.aspx. (accessed October 05, 2024).

  • Harvard

    Dam, Samudrapom. 2023. Scale-MAE: A Novel Pretraining Framework for Improved Remote Sensing Imagery Analysis. AZoAi, viewed 05 October 2024, https://www.azoai.com/news/20231005/Scale-MAE-A-Novel-Pretraining-Framework-for-Improved-Remote-Sensing-Imagery-Analysis.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of AZoAi.
Post a new comment
Post

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.

You might also like...
UAVs and Computer Vision Enhance Remote Runway Inspections