Mitigating Simulation Mis-specification in Population Genetics: A Domain-Adaptive Deep Learning Approach

In an article recently published in the journal PLOS Genetics, researchers investigated the feasibility of using domain-adaptive deep learning (DL) methods to effectively address the simulation mis-specification problem in population genetics.

Study: Mitigating Simulation Mis-specification in Population Genetics: A Domain-Adaptive Deep Learning Approach. Image credit: Generated using DALL.E.3
Study: Mitigating Simulation Mis-specification in Population Genetics: A Domain-Adaptive Deep Learning Approach. Image credit: Generated using DALL.E.3

Background

In population genetics, the use of substantial amounts of simulated data for training DL models is crucial for their success. Evolution follows relatively simple rules under simplifying and largely realistic assumptions. These rules, coupled with the advancements in computing power, are exploited by the latest generation of computational simulators to efficiently generate large amounts of accurately labeled synthetic data across several evolutionary scenarios.

Additionally, programming libraries such as stdpopsim have provided researchers access to these simulators while improving the simulation workflow reproducibility. The synthetic training data generation facility acts as the foundation of the simulate-and-train approach of supervised machine learning (ML) for population genetics inference.

However, the approach primarily relies on well-specified models for simulation. A trained DL model can promote the biases that exist in the simulated data and cannot perform effectively on real data due to simulation mis-specification/when simulation assumptions and the underlying real data generative process are different. Studies have demonstrated that the performance of the models degrades significantly when the levels of mis-specification become severe.

The proposed approach

In this study, researchers proposed the use of domain adaptation techniques to address the simulation mis-specification problem in population genetics by training the ML model using both real and simulated data. Specifically, researchers reframed the simulation mis-specification problem as an unsupervised domain adaptation problem. They obtained substantial amounts of accurately labeled training data in the source domain using population-genetic simulations.

Subsequently, they applied the trained model to the unlabeled real data in the target domain. Domain adaptation techniques were used to address the mismatch between the source and target domain during the training of the model, where a model learned from one data distribution was applied to a dataset drawn from another distribution.

Researchers displayed the feasibility of this approach by incorporating a domain-adaptive neural network architecture into two DL models for population genetic inference, including SIA, which can infer positive selection from features of ancestral recombination graph (ARG), and ReLERNN, which can infer recombination rates from raw genotypic data.

Researchers developed domain-adaptive versions of the ReLERNN and SIA models, including dadaSIA and dadaReLERNN, with each of them employing a gradient reversal layer (GRL). They investigated the feasibility of using the GRL-based domain-adaptation technique to establish a domain-invariant representation of the data.

The neural networks contained two key components, including the original networks, which were applied to the labeled examples from the simulated/source domain, and the alternative branches, which utilized the similar feature-extraction portions of the original networks with the goal of distinguishing data from the simulated/source domain and real/target domain.

The GRL reversed the gradient sign for the feature extractor based on the domain-classifier loss during the training of the neural network using back-propagation to systematically undermine the goal of distinguishing two domains and promote domain invariance during feature extraction.

Experimental evaluation and findings

Researchers compared the performance of the domain adaptive models with standard models by designing two sets of benchmark experiments. In both experiments, the methods were assessed using “real” data in the target domain generated by simulation. However, this “real” data included features not considered by the simpler simulator employed for the source domain.

Background selection was present only in the target domain in the first set of experiments/background selection experiments, while the demographic model utilized for the source-domain simulations in the second set of experiments/demography mis-specification experiments was estimated using real data produced using a more complex demographic model.

Additionally, researchers also performed several experiments to investigate the dadaSIA model performance under increasingly severe simulation mis-specification and applied dadaSIA to multiple loci in the human genome that have been analyzed previously with SIA using whole-genome sequence data from the 1000 Genomes CEU population.

In both demography mis-specification and background selection experiments and regression and classification tasks, the dadaSIA demonstrated significantly better performance compared to the standard model. The domain-adaptive model attained significant improvements on the regression task compared to the standard model, which reduced the upward bias of the estimation and absolute error.

In all cases, the dadaSIA model almost realized the upper bound of the hypothetical true model and outperformed the standard model by a large margin, which indicated that domain adaptation substantially mitigated the simulation mis-specification effects on SIA.

The dadaReLERNN also displayed similar results as dadaSIA as it rectified the downward bias in recombination-rate estimates and decreased the mean absolute error (MAE). In both demography mis-specification and background-selection experiments, the dadaReLERNN model significantly reduced the MAE compared to the standard ReLERNN model.

DadaSIA showed good performance when mis-specification was caused by light to moderate bottlenecks or only by genealogy inference. Although the performance of dadaSIA deteriorated when the mis-specification levels became more severe, the dadaSIA still displayed a better performance compared to the standard model, even with a five% bottleneck.

Moreover, the addition of domain adaptation did not significantly change the predictions of the standard SIA model for real data. In several cases, the addition led to improvement in SIA’s prediction, which demonstrated the feasibility of applying dadaSIA to real data.

Journal reference:
Samudrapom Dam

Written by

Samudrapom Dam

Samudrapom Dam is a freelance scientific and business writer based in Kolkata, India. He has been writing articles related to business and scientific topics for more than one and a half years. He has extensive experience in writing about advanced technologies, information technology, machinery, metals and metal products, clean technologies, finance and banking, automotive, household products, and the aerospace industry. He is passionate about the latest developments in advanced technologies, the ways these developments can be implemented in a real-world situation, and how these developments can positively impact common people.

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Dam, Samudrapom. (2023, November 10). Mitigating Simulation Mis-specification in Population Genetics: A Domain-Adaptive Deep Learning Approach. AZoAi. Retrieved on September 18, 2024 from https://www.azoai.com/news/20231110/Mitigating-Simulation-Mis-specification-in-Population-Genetics-A-Domain-Adaptive-Deep-Learning-Approach.aspx.

  • MLA

    Dam, Samudrapom. "Mitigating Simulation Mis-specification in Population Genetics: A Domain-Adaptive Deep Learning Approach". AZoAi. 18 September 2024. <https://www.azoai.com/news/20231110/Mitigating-Simulation-Mis-specification-in-Population-Genetics-A-Domain-Adaptive-Deep-Learning-Approach.aspx>.

  • Chicago

    Dam, Samudrapom. "Mitigating Simulation Mis-specification in Population Genetics: A Domain-Adaptive Deep Learning Approach". AZoAi. https://www.azoai.com/news/20231110/Mitigating-Simulation-Mis-specification-in-Population-Genetics-A-Domain-Adaptive-Deep-Learning-Approach.aspx. (accessed September 18, 2024).

  • Harvard

    Dam, Samudrapom. 2023. Mitigating Simulation Mis-specification in Population Genetics: A Domain-Adaptive Deep Learning Approach. AZoAi, viewed 18 September 2024, https://www.azoai.com/news/20231110/Mitigating-Simulation-Mis-specification-in-Population-Genetics-A-Domain-Adaptive-Deep-Learning-Approach.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of AZoAi.
Post a new comment
Post

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.

You might also like...
Hybrid Deep Learning Optimizes Renewable Power Flow