In a recent publication in the journal npj Digital Medicine, researchers proposed Relay Learning, a secure deep-learning framework that accomplishes the physical isolation of clinical data from external intruders.
Background
The recent synergy between big data and artificial intelligence (AI) has ushered in numerous applications, notably in AI-aided medicine. Remarkable progress has been achieved in the development of robust medical AI systems for diverse clinical tasks such as intracranial hemorrhage detection, fundus disease identification, and lung cancer screening. These advancements rely on extensive datasets collected from multiple clinical sites.
However, the substantial potential of big data presents challenges, particularly in data privacy and security. Approaches such as Federated Learning and Swarm Learning aim to mitigate these concerns by sharing only model parameters, not actual data, among sites. Nevertheless, these methods require online connectivity during training, introducing security risks such as network firewall breaches and data exposure. Existing AI techniques, therefore, fall short of adequately addressing data privacy and security concerns when harnessing big data.
Relay learning architecture
Relay Learning represents a secure multi-site deep learning framework that operates de-connectedly. In each Relay Learning instance, data from the different hosts is processed sequentially. The deep model is the sole entity authorized to access data within and between hosts, eliminating the need for simultaneous Internet connections.
The pipeline of Relay Learning, resembling a sequential fine-tuning strategy, involves training a model on one host, fine-tuning it on the next, and so on. However, Relay Learning introduces the DoubleGAN-based relay system, which facilitates knowledge retention across hosts. The DoubleGAN comprises two generative adversarial networks (GANs) that model images and labels separately, enabling the creation of heritage data while preserving data privacy. This heritage data encapsulates the data distribution from previous hosts and contributes to training the task model and DoubleGAN.
A two-stage strategy was devised to address a sharp transition issue during the training of DoubleGAN. The first stage fine-tunes the model on current data, producing a model capable of generating heritage data for that host. In the second stage, this fine-tuned model and the original DoubleGAN are combined, and both are used to train the model for the new host. This two-stage approach mitigates issues related to transitioning between hosts.
Multi-site datasets and evaluation methodology
This sub-header encapsulates the section's focus on the datasets used in the evaluation of Relay Learning and the evaluation metrics and methodology employed, maintaining the academic and professional tone.
Datasets: The evaluation of Relay Learning involved three multi-site datasets: the Retina Fundus dataset, the Mediastinum tumor dataset, and the Brain Midline dataset. The Brain Midline dataset had samples within each clinical site, while the other two datasets were synthesized to mimic multi-site learning.
The Retina Fundus dataset (F1–F5) was split into five distinct hosts. These hosts had varying imaging conditions, resulting in diverse data distributions. Data splitting was done carefully, preserving each host's unique data distribution. Preprocessing involved cropping images to the disc region, resizing, and normalization.
The Mediastinum tumor dataset (T1-T8) was compiled from eight institutions, with five (T1-T5) used for internal training and testing and three (T6-T8) exclusively for external testing. Board-certified radiologists annotated segmentation masks. Images were cropped and resized. Internal data was split randomly into training and test sets, while external data was designated for testing. The Brain Midline dataset (M1-M8) was collected from eight clinical institutions, contributing to brain computerized tomography (CT) scans. Images were normalized, underwent detailed annotation, and were split into training and test sets.
The evaluation metrics in the experiments included Hausdorff Distance (HD) for the Brain Midline dataset and Dice similarity coefficient (Dice) for the Mediastinum tumor and Retina Fundus datasets. The study adhered to ethical principles, obtaining approvals and waivers for anonymized data processing.
Results: The Relay Learning approach was thoroughly evaluated in diverse multi-site clinical settings, spanning various diseases and anatomical structures. This evaluation comprised three distinct tasks, each presenting unique challenges and opportunities. Relay Learning was strategically applied in these tasks, showcasing its adaptability and efficiency.
The initial task involved segmenting crucial retinal fundus structures using well-established public datasets from five sources. Additionally, assembling a 3D CT dataset from eight institutions allowed the emulation of a real-world multi-site system for diagnosing mediastinum tumors. In the final task, Relay Learning trained a deep model for brain midline localization within five medical institutions, with external testing at three additional sites. Notably, the brain midline localization task required data sovereignty at individual sites, distinguishing Relay Learning from central learning approaches.
In all three tasks, data was split for training and testing within and across sites, enabling a comprehensive assessment of Relay Learning's capabilities compared to other learning paradigms. Results showed Relay Learning outperformed local and sequential learning, with minimal impact from host order variations. Relay Learning also surpassed central learning, especially on external test sets. In the third task, Relay Learning exhibited remarkable robustness, achieving lower failure rates and outperforming local and sequential learning across internal and external tests, reinforcing its excellence in diverse multi-site clinical applications.
Conclusion
In summary, researchers explored the importance of cross-site and international medical data sharing in the modern healthcare landscape. Relay Learning not only respects privacy provisions and ethics but also advocates for the depersonalization of knowledge offline in AI models, fostering innovation in AI-aided medical solutions, respecting human rights, promoting healthcare resource sharing, and revolutionizing collaboration in biomedical and healthcare research, even in the face of global challenges.