Researchers have built EpiBERT, a groundbreaking AI that deciphers how genes turn on and off, paving the way for breakthroughs in disease research and personalized medicine.
Research: A multi-modal transformer for cell type-agnostic regulatory predictions. Image Credit: ZinetroN / Shutterstock
A team of investigators from Dana-Farber Cancer Institute, The Broad Institute of MIT, Harvard, Google, and Columbia University have created an artificial intelligence model that can predict which genes are expressed in any human cell. The research is published in the journal Cell Genomics.
The model, called EpiBERT, was inspired by BERT, a deep learning model designed to understand and generate human-like language. Like BERT, EpiBERT was trained using a "masked accessibility" learning method, where it predicts missing chromatin accessibility signals, similar to how BERT predicts missing words in a sentence.
EpiBERT is a multi-modal transformer that integrates genomic sequence, chromatin accessibility, and motif enrichments to make its predictions. It was trained on data from hundreds of human cell types in two phases. In the first phase, it was pre-trained to learn the relationship between DNA sequence and chromatin accessibility across large 524-kilobase (kb) chunks of the genome.
In the second phase, it was fine-tuned to predict gene expression based on these learned relationships. It was fed the genomic sequence, which is 3 billion base pairs long, along with maps of chromatin accessibility that inform which of these sequences are unwound from the chromosome and read by the cell.
The model was first trained to learn the relationship between DNA sequence and chromatin accessibility across large chunks of the genome in a specific cell type. It then uses these learned relationships to predict which genes were active in the corresponding cell type. EpiBERT employs efficient linear-scaling attention layers, which allow it to process large genomic windows efficiently while maintaining high accuracy.
It accurately identified regulatory elements – parts of the genome recognized by transcription factors – and their influence on gene expression across many cell types, building a "grammar" that is generalizable and predictable. This regulatory ‘grammar’ enables the model to uncover patterns that determine how genes are turned on and off in different cellular contexts.
In benchmarks, EpiBERT achieved comparable accuracy to Enformer, a state-of-the-art sequence-only model, in predicting gene expression and significantly outperformed it in generalizing to unobserved cell types. It also accurately predicted chromatin accessibility quantitative trait loci (caQTLs) and regulatory motifs, demonstrating its ability to capture the regulatory grammar of the genome.
This grammar-building process can be likened to how a large language model, such as ChatGPT, learns to build meaningful sentences and paragraphs from many examples of text. The EpiBERT model can process accessibility and predict functional bases as well as RNA expression for a never-before-seen cell type.
Every cell in the body has the same genome sequence, so the difference between the two types of cells is not the genes in the genome but which genes are turned on, when, and how much. Approximately 20% of the genome codes for regulatory elements that play a crucial role in determining gene activation.
Still, very little is known about where those codes are in the genome, what their instructions look like, or how mutations affect function in a cell. EpiBERT will shed light on how genes are regulated in cells and, potentially, how that cell's regulatory system can be mutated in ways that lead to diseases such as cancer. However, the model's performance depends on the similarity of regulatory motifs between training and hold-out cell types, and it currently relies solely on chromatin accessibility data, which may not capture all regulatory features, such as repressive epigenetic states or 3D chromatin organization.
Researchers suggest that future improvements could enhance EpiBERT’s predictive power by integrating additional data types, such as histone modifications, transcription factor binding, and chromatin topology.
The Broad Institute, the Novo Nordisk Foundation, the National Genome Research Institute, the Sharf Green Cancer Research Fund, the Richard and Nancy Lubin Family, and the American Cancer Society. Training data for EpiBERT was sourced from large-scale genomic databases, including ENCODE, CATLAS, and GEO. Tensor Processing Unit (TPU) access and support provided by Google.
Source:
Journal reference:
- Javed, N., Weingarten, T., Sehanobish, A., Roberts, A., Dubey, A., Choromanski, K., & Bernstein, B. E. (2025). A multi-modal transformer for cell type-agnostic regulatory predictions. Cell Genomics, 100762. DOI: 10.1016/j.xgen.2025.100762, https://www.sciencedirect.com/science/article/pii/S2666979X25000187