In an article recently published in the journal PLOS Genetics, researchers investigated the feasibility of using domain-adaptive deep learning (DL) methods to effectively address the simulation mis-specification problem in population genetics.
Background
In population genetics, the use of substantial amounts of simulated data for training DL models is crucial for their success. Evolution follows relatively simple rules under simplifying and largely realistic assumptions. These rules, coupled with the advancements in computing power, are exploited by the latest generation of computational simulators to efficiently generate large amounts of accurately labeled synthetic data across several evolutionary scenarios.
Additionally, programming libraries such as stdpopsim have provided researchers access to these simulators while improving the simulation workflow reproducibility. The synthetic training data generation facility acts as the foundation of the simulate-and-train approach of supervised machine learning (ML) for population genetics inference.
However, the approach primarily relies on well-specified models for simulation. A trained DL model can promote the biases that exist in the simulated data and cannot perform effectively on real data due to simulation mis-specification/when simulation assumptions and the underlying real data generative process are different. Studies have demonstrated that the performance of the models degrades significantly when the levels of mis-specification become severe.
The proposed approach
In this study, researchers proposed the use of domain adaptation techniques to address the simulation mis-specification problem in population genetics by training the ML model using both real and simulated data. Specifically, researchers reframed the simulation mis-specification problem as an unsupervised domain adaptation problem. They obtained substantial amounts of accurately labeled training data in the source domain using population-genetic simulations.
Subsequently, they applied the trained model to the unlabeled real data in the target domain. Domain adaptation techniques were used to address the mismatch between the source and target domain during the training of the model, where a model learned from one data distribution was applied to a dataset drawn from another distribution.
Researchers displayed the feasibility of this approach by incorporating a domain-adaptive neural network architecture into two DL models for population genetic inference, including SIA, which can infer positive selection from features of ancestral recombination graph (ARG), and ReLERNN, which can infer recombination rates from raw genotypic data.
Researchers developed domain-adaptive versions of the ReLERNN and SIA models, including dadaSIA and dadaReLERNN, with each of them employing a gradient reversal layer (GRL). They investigated the feasibility of using the GRL-based domain-adaptation technique to establish a domain-invariant representation of the data.
The neural networks contained two key components, including the original networks, which were applied to the labeled examples from the simulated/source domain, and the alternative branches, which utilized the similar feature-extraction portions of the original networks with the goal of distinguishing data from the simulated/source domain and real/target domain.
The GRL reversed the gradient sign for the feature extractor based on the domain-classifier loss during the training of the neural network using back-propagation to systematically undermine the goal of distinguishing two domains and promote domain invariance during feature extraction.
Experimental evaluation and findings
Researchers compared the performance of the domain adaptive models with standard models by designing two sets of benchmark experiments. In both experiments, the methods were assessed using “real” data in the target domain generated by simulation. However, this “real” data included features not considered by the simpler simulator employed for the source domain.
Background selection was present only in the target domain in the first set of experiments/background selection experiments, while the demographic model utilized for the source-domain simulations in the second set of experiments/demography mis-specification experiments was estimated using real data produced using a more complex demographic model.
Additionally, researchers also performed several experiments to investigate the dadaSIA model performance under increasingly severe simulation mis-specification and applied dadaSIA to multiple loci in the human genome that have been analyzed previously with SIA using whole-genome sequence data from the 1000 Genomes CEU population.
In both demography mis-specification and background selection experiments and regression and classification tasks, the dadaSIA demonstrated significantly better performance compared to the standard model. The domain-adaptive model attained significant improvements on the regression task compared to the standard model, which reduced the upward bias of the estimation and absolute error.
In all cases, the dadaSIA model almost realized the upper bound of the hypothetical true model and outperformed the standard model by a large margin, which indicated that domain adaptation substantially mitigated the simulation mis-specification effects on SIA.
The dadaReLERNN also displayed similar results as dadaSIA as it rectified the downward bias in recombination-rate estimates and decreased the mean absolute error (MAE). In both demography mis-specification and background-selection experiments, the dadaReLERNN model significantly reduced the MAE compared to the standard ReLERNN model.
DadaSIA showed good performance when mis-specification was caused by light to moderate bottlenecks or only by genealogy inference. Although the performance of dadaSIA deteriorated when the mis-specification levels became more severe, the dadaSIA still displayed a better performance compared to the standard model, even with a five% bottleneck.
Moreover, the addition of domain adaptation did not significantly change the predictions of the standard SIA model for real data. In several cases, the addition led to improvement in SIA’s prediction, which demonstrated the feasibility of applying dadaSIA to real data.