In a recent paper submitted to the arXiv* server, researchers introduced TreeFormer, a novel semi-supervised framework based on transformer architecture for accurately estimating tree counting in aerial and satellite images. This article explores the benefits and advancements brought forth by TreeFormer, highlighting its significance in various fields.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Background
The role of trees in maintaining ecological balance and planetary health cannot be overstated. Counting trees using high-resolution images has practical applications in forest inventory, farm management, urban planning, and crop estimation. While traditional methods like field surveys are time-consuming and expensive, leveraging aerial and satellite images, along with light detection and ranging (LiDAR) data, offers accurate results.
To overcome the high cost of labeling a large number of trees in supervised methods, researchers introduced TreeFormer, a semi-supervised framework based on transformer architecture. TreeFormer incorporates a pyramid vision transformer for feature extraction and a contextual attention-based feature fusion module.
Additionally, they proposed a pyramid learning strategy that leverages unlabeled data through local tree density consistency and local tree count ranking losses. They also developed a tree counter token for estimating global tree counts. The proposed method outperforms state-of-the-art approaches on benchmark datasets (Jiangsu, Yosemite) and a newly created dataset (KCL-London) with manually annotated tree locations.
Related work
Object counting: In object counting, various methods have been developed for different objects, including humans, cells, cars, and trees. Fully supervised methods achieve high performance but require extensive labeled data. Weakly or semi-supervised methods reduce the reliance on labeled data by incorporating unlabeled or weakly annotated data.
Tree counting: Tree counting poses additional challenges due to dense canopies and interlocking trees. Traditional methods involve detecting tree areas and using segmentation techniques, but their accuracy is limited. Deep neural networks (DNNs) have shown promise in tree detection and counting. Detection-based methods use bounding boxes to identify and count individual trees, while density estimation-based methods generate density maps to estimate tree numbers. These methods leverage DNNs and have demonstrated better performance.
However, limited research has focused on tree density estimation, and existing approaches often rely on basic DNN architectures. The scarcity of annotated training data in tree counting calls for an efficient semi-supervised framework.
Methodology
The researchers proposed a semi-supervised framework for estimating tree density maps from remote sensing images. The framework consists of an encoder-decoder architecture with transformer blocks. It includes a pyramid tree feature representation (PTFR) module in the encoder, a contextual attention-based feature fusion (CAFF) module in the decoder, a tree density regressor (TDR) module for density map estimation, and a tree counter token (TCT) module for tree counting.
The framework utilizes supervised distribution matching loss for labeled data and introduces local tree density consistency and local tree count ranking losses for unlabeled data. A global tree count regularization is applied to optimize the network's predictions.
Experiments
Datasets: Three datasets were used in the experiments.
- KCL-London dataset: This dataset contains high-resolution images (0.2m ground sampling distance (GSD)) from London, divided into 308 unlabeled and 613 labeled images. The labeled set is further split into 452 training and 161 testing samples.
- Jiangsu dataset: This dataset consists of 24 Gaofen-II satellite images (0.8m GSD) from Jiangsu Province, China. It contains 664,487 manually annotated trees across 2400 images, divided into 1920 training and 480 test samples.
- Yosemite dataset: This dataset covers Yosemite National Park, California, with a rectangular image of 19,200 × 38,400 pixels (0.12m GSD). It contains 98,949 manually annotated trees and is split into 1350 training and test samples.
Implementation details: The model uses an encoder-decoder architecture with a transformer-based encoder and three-scale density maps estimated by the decoder. Data augmentation techniques like horizontal flipping and random cropping are employed. The network is trained using the Adam optimizer with parameters fine-tuned on the KCL-London dataset.
Evaluation and comparisons: The evaluation protocol involved dividing the training sets into labeled and unlabeled subsets. Performance metrics such as mean absolute error (EMAE), R-Squared (R2), root mean squared error (ERMS), grid average mean absolute error (GAME), precision (P), recall (R), and F1-measure (F1) were utilized. Comparisons with state-of-the-art models were conducted in both semi-supervised and supervised settings, with TreeFormer outperforming existing methods in both groups.
Overall, TreeFormer demonstrates superior performance compared to state-of-the-art models, showcasing the effectiveness of its architecture and learning strategy.
Conclusion
In conclusion, TreeFormer presents a significant advancement in tree counting from remote sensing images. The semi-supervised framework, built upon the transformer architecture, combines feature fusion and tree density estimation modules to improve extraction and mapping accuracy.
The proposed pyramid learning strategy enhances performance by incorporating local tree count ranking and density consistency. The results on multiple datasets demonstrate TreeFormer's superiority over existing models. Future work should focus on improving generalizability across diverse datasets by employing domain adaptation techniques and considering regional variations in tree shapes.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.