In a paper published in the journal Scientific Reports, researchers tackled the challenges of extracting building footprints from high-resolution aerial and satellite images in urban areas. They proposed automating this process by integrating red, green, and blue (RGB) orthophotos with digital surface models (DSM) to create a consistent four-band dataset, enhancing pixel-to-pixel data fusion.
Using deep convolutional networks for semantic image segmentation version 3 (DeepLabv3) algorithms for pixel-based segmentation, they achieved superior accuracy and detailed building boundary delineation over a 21 km² area in Turin, Italy. This method significantly reduced training time compared to conventional approaches like U-shaped networks (U-Net). The study demonstrated the potential of this integrated approach for applications in 3D modeling, change detection, and urban planning, supporting urban management tasks.
Background
Past work in building footprint segmentation includes rule-based methods, machine learning, and deep learning, often enhanced by data fusion techniques. Rule-based methods faced adaptability challenges, while machine learning improved detection but struggled with data alignment and computational demands. Using convolutional neural networks (CNN) like DeepLabv3, deep learning showed high accuracy, especially when incorporating multi-source data such as RGB orthophotos and DSM.
Data fusion significantly improved segmentation accuracy by enhancing contrast and boundary delineation. Despite advancements, challenges like misalignment and the complexity of multi-source data integration persist, leading to innovative solutions like generative adversarial networks (GANs).
Building Footprint Segmentation
The study utilized two primary raster layers through aerial photogrammetry campaigns: an RGB orthomosaic with a 25 cm/pixel resolution providing spectral information and a DSM raster layer with a 50 cm/pixel resolution offering elevation data.
Pixel-level data fusion was employed to create a four-band integrated dataset, enhancing spectral and elevation information crucial for accurate building footprint segmentation. This process involved resampling the DSM to match the RGB orthomosaic's resolution, cropping both datasets to the same extent, stacking them along the band dimension, and normalizing pixel values.
The research focused on two leading deep learning algorithms for pixel-based semantic segmentation: U-Net and DeepLabv3. U-Net, known for its encoder-decoder architecture with skip connections, excels in capturing local and global features and recovering fine-grained details. However, its fixed kernel size may limit contextual information capture.
DeepLabv3, on the other hand, uses atrous convolution and atrous spatial pyramid pooling (ASPP) to handle large receptive fields and multi-scale contextual information efficiently. However, it may produce lower-resolution output maps. Both algorithms were evaluated for accuracy and boundary delineation on standalone and integrated datasets.
The team manually digitized 450 buildings for training and validation and converted them into binary masks with a 256×256-pixel size. The dataset was split into 80% training and 20% validation sets. The exercise involved the TensorFlow framework and architecture geographic information system (ArcGIS) for data preparation. Key training parameters included the softmax activation function, cross-entropy loss function, Adam optimizer, and 8 and 20 epochs batch size. U-Net and DeepLabv3 used ResNet-50 as their backbone to enhance feature extraction and segmentation accuracy.
Enhancing Urban Segmentation
The results and analysis section investigates the impact of data fusion and elevation information on building footprint segmentation through various evaluation metrics. These metrics, derived from the confusion matrix, encompass precision, accuracy, recall, F1 score, and intersection over union (IoU), comprehensively evaluating how well models distinguish building pixels from non-building pixels. Notably, DeepLabv3 integrated emerges as the top performer, showcasing significant improvements in the recall, F1 score, and IoU compared to other configurations.
It highlights the effectiveness of integrating RGB and DSM data to enhance segmentation accuracy, particularly in complex urban environments. Computational efficiency is also emphasized, with DeepLabv3 demonstrating faster training times due to its efficient use of atrous convolutions, underscoring its suitability for practical deployment. Furthermore, the performance evaluation delves into the nuanced interactions between model architecture and dataset complexity.
DeepLabv3's advanced features, such as atrous spatial pyramid pooling, prove crucial in leveraging the richer feature set provided by the integrated dataset. This capability allows DeepLabv3 to excel in capturing multi-scale contextual information essential for precise segmentation, as evidenced by its superior results across all metrics evaluated. Despite the computational overhead associated with the integrated dataset, the substantial gains in segmentation quality justify its use, emphasizing the pivotal role of data fusion and elevation information in enhancing urban mapping applications.
The results and analysis section combines quantitative metrics with qualitative visualizations to demonstrate the impact of data fusion and model architecture on building footprint segmentation accuracy. Visual comparisons across varied urban scenarios highlight U-Net integrated's improved performance with elevation data in dense areas, while DeepLabv3 excels in handling complex geometries and terrain variations.
This comprehensive approach validates the effectiveness of integrating RGB and DSM data, providing practical insights for optimizing segmentation workflows in urban environments. Ultimately, the study underscores how these advancements enhance building footprint delineation, offering significant benefits for urban mapping and planning strategies.
Conclusion
To sum up, this study utilized integrated high-resolution datasets that combined RGB orthophotos with DSMs to automate the extraction of building footprints in urban areas. By employing DeepLabv3 algorithms, the segmentation process effectively utilized height information derived from DSMs, resulting in precise delineation of creating boundaries.
Evaluation conducted in Turin, Italy, underscored the approach's advantages: superior accuracy and reduced training time compared to traditional methods like U-Net. These outcomes highlight the potential of this approach for enhancing applications such as 3D modeling, change detection, and urban planning.