In a paper published in the journal Scientific Data, researchers introduced an innovative methodology using cutting-edge cloud computing platforms like Google Earth Engine. They established a vast global database encompassing nearly two million training units across seven primary and nine secondary land cover classes from 1984 to 2020.
Leveraging machine learning (ML) algorithms and Landsat imagery, they ensured data quality and representation across diverse biogeographic regions. This resource held immense value for studies ranging from land cover changes to urban development.
Background
The accuracy of remote sensing-derived land cover maps hinges on robust training data, balancing size and quality across various classification algorithms. Existing global datasets are limited by coverage, resolution, or temporal scope, prompting the Global Land Cover Estimation (GLanCE) project. GLanCE aims to create a high-resolution, comprehensive training database spanning four decades, leveraging cloud computing and ML for accuracy and ecological representation.
GLanCE Training Data Collection Overview
The GLanCE project conducted meticulous training data collection involving trained image analysts from Boston University. Analysts used various tools, including the Google Earth Engine API and high-resolution imagery, to interpret land cover characteristics through a detailed process. Each entry in the database represented individual Landsat pixels, termed as training units.
A stringent quality assessment involved expert reviews and an ML-based cross-validation process to ensure accuracy, eliminating discrepancies in labeled training units. The team comprised six to 12 trained analysts who underwent consistent training, periodic refresher courses, and weekly meetings for quality enhancement. This comprehensive approach resulted in continent-specific databases with up to 23 land cover attributes per unit.
GLanCE collected training data from multiple sources, including the System for Terrestrial Ecosystem Parameterization (STEP) database, Landsat spectral-temporal features via unsupervised clustering, and feedback training units to rectify classification errors. These sources delivered homogeneous and heterogeneous land cover representations, emphasizing thematic detail and completeness.
Additionally, GLanCE supplemented its dataset by harmonizing and standardizing publicly available, collaborator-contributed, and team-collected datasets, ensuring alignment with the GLanCE land cover classification key. Despite these efforts, the team encountered underrepresented land cover classes, prompting them to augment the dataset with data from various sources such as the World Settlement Footprint and Global Surface Water products. They utilized algorithms to select candidate training units in this augmentation process.
Pre-processing and harmonizing supplementary datasets involved several steps, including interpreter confidence filtering, legend harmonization, comparison with existing land cover products, and visual inspection to ensure quality. While striving to align these datasets with GLanCE standards, limitations persisted due to limited control over external dataset accuracy and consistency. Despite these constraints, the iterative approach ensured continual refinement and enhancement of the training dataset's quality and relevance for global land cover mapping.
Overview of GLanCE Training Dataset
The GLanCE training dataset, available under the Creative Commons license CC-BY-4.0 from Source Cooperative, comprises two hierarchical sets of land cover classes: seven broad (Level One) and nine secondary (Level Two) classes. This classification scheme is designed primarily for land cover and aligns with standard systems like the IPCC and FAO Land Cover Classification.
Based on Landsat pixels from 1984 to 2020, each training unit includes Level One & Two labels and additional attributes like LC_Confidence and Segment_Type for stability or transitional status. Around 79% of the dataset represents stable land cover, while 21% denotes change, encompassing abrupt and gradual transitions like forest regrowth or coastal water dynamics. The dataset's global distribution covers various land cover classes, yet only 50% of the units contain Level Two legend information due to missing land use details in supplementary datasets.
This extensive dataset amalgamates disparate sources, forming a comprehensive global training database. It includes publicly available, collaborator-provided, GLanCE-collected, and Boston University team-collected data. Notably, GLanCE data offers detailed ancillary information and changed records over an extended period, often spanning 20 years between 1999 and 2019. However, distribution among continents varies, with Europe and South America exhibiting higher data availability, the latter encompassing time segment lengths of up to 35 years in some cases. Despite these variations, the dataset serves as the most extensive and detailed publicly accessible global land cover and land use training resource to date, enabling various applications in the field.
GLanCE Dataset: Rigorous Validation Analysis
The GLanCE Dataset underwent a robust validation process employing a two-step ML-based cross-validation approach to ensure data quality. Researchers used climate variables and spatial clustering to remove approximately 15% of the training data uniformly across continents, targeting misclassification and confusion between classes. This process also compared classification results against reference data inspired by previous studies. Despite high accuracy across most continents, challenges persisted in distinguishing certain land cover types, notably herbaceous vegetation and shrubs, reflecting mixed representations within training units, highlighting difficulties even at a 30m spatial resolution.
Conclusion
To sum up, the GLanCE Dataset underwent meticulous validation procedures, utilizing sophisticated ML techniques and spatial analysis to ensure data accuracy. Across continents, about 15% of the training data underwent careful removal to address misclassification issues and confusion among distinct land cover classes.
Challenges persisted while achieving accuracy, particularly in differentiating specific land cover types like herbaceous vegetation and shrubs due to their mixed representations within training units. These findings underscore the complexity of delineating specific land cover categories, even at an acceptable spatial resolution of 30 meters. This dataset presents a comprehensive global land cover analysis resource yet emphasizes the need for continued research to address nuanced classification challenges.
Article Revisions
- Dec 18 2023 - Fixed error in journal paper hyperlink