Stanford and NVIDIA's Evo 2 - Large AI Model for Biomolecular Sciences

Evo 2 is a powerful new AI model that provides deep analysis across species for DNA, RNA, and proteins.

Capable of understanding the genetic code across all domains of life.

Evo 2 is the largest publicly available AI model for genomic data globally, built on the NVIDIA DGX Cloud platform through a collaboration between the non-profit biomedical research institute Arc Institute and Stanford University.

Evo 2 was trained on a large dataset of nearly 9 trillion nucleotides, which are the basic building blocks of DNA and RNA. Evo 2 can be applied to biomolecular research, including predicting the shape and function of proteins based on genetic sequences, identifying new molecules for healthcare and industrial applications, and evaluating how genetic mutations affect their functions.

"Evo 2 represents a significant milestone in generative genomics. By advancing our understanding of these fundamental building blocks of life, we can pursue healthcare and environmental science solutions that are unimaginable today."
-- Patrick Hsu (Co-founding Core Investigator at the Arc Institute and Assistant Professor of Bioengineering at UC Berkeley)

"Designing new biology has traditionally been a tedious, unpredictable, and manual process. With Evo 2, we make the bio-design of complex systems more accessible to researchers, thus enabling new beneficial advances that would otherwise take an extensive amount of time."
-- Brian Hie (Assistant Professor of Chemical Engineering at Stanford University, Dieter Schwarz Foundation Stanford Data Science Faculty Fellow, and Innovation Fellow at the Arc Institute)

Broad applications in biomolecular science

Evo 2 can provide deep insights into DNA, RNA, and proteins. Trained across multiple species, including plants, animals, and bacteria, this model can be applied in scientific fields such as healthcare, agricultural biotechnology, and materials science.

Evo 2 adopts a novel model architecture capable of processing genomic sequences up to one million tokens long. This broad view of the genome may uncover insights into the connections between distant parts of the genome and their implications for cellular function, gene expression, and disease mechanisms.

A human gene contains tens of thousands of nucleotides — thus, for AI models to analyze such complex biological systems, they need to process as large a portion of the genetic sequence as possible at once.

Healthcare and drug discovery

Evo 2 can help researchers understand which genetic variations are associated with specific diseases and design new molecules precisely targeting these regions to treat diseases. For example, researchers at Stanford University and the Arc Institute found that in tests on the BRCA1 gene (a gene linked to breast cancer), Evo 2 could predict with 90% accuracy whether previously unidentified mutations would affect gene function.

Agriculture

This model can help scientists develop crop varieties that are more climate-resilient or have higher nutritional density by providing deep insights into plant biology, thus addressing global food shortages. In other scientific fields, Evo 2 can also be applied to biofuel design or the engineering of proteins that degrade oil or plastic.

Overview of the Evo2 Model Architecture, Training Process, Datasets, and Evaluation

Evo 2 models DNA sequences, enabling the application of the central dogma across molecular and cellular scales.
Evo 2 is trained on data covering all domains of life, containing trillions of nucleotide sequences, with each UMAP point representing a single genome.
A two-stage training strategy is adopted to optimize model performance while scaling up to one million base pairs, capturing a wide range of biological patterns.
Innovative data augmentation and weighting methods focus on functional gene elements during the pre-training phase and emphasize long sequence composition during the mid-term training phase.
The tokens used to train Evo 2 are divided into 40B and 7B, corresponding to the short-term pre-training phase and the long-context mid-term training phase, respectively.
The schematic diagram of the new multi-hybrid StripedHyena 2 architecture shows the efficient module layout of short-term explicit (SE), mid-term regularization (MR), and long-term implicit (LI) hyena operators.
Under 1024 GPUs and a scale of 40B, the iteration time of StripedHyena 2, StripedHyena 1, and Transformer was compared, showing significant throughput improvements.
The validation perplexity of Evo 2's mid-term training compares model size and context length, demonstrating performance advantages as size and context length increase.
The revised "needle-in-a-haystack" task evaluated Evo 2's recall ability in long contexts (up to 1 million sequence lengths), demonstrating that the model can achieve effective recall in a 1-million-token context.

The mechanistic interpretability of Evo2 reveals features at the DNA, RNA, protein, and organism levels.

Sparse autoencoders (SAEs) were trained on Evo 2 to extract SAE features related to interpretable biological functions, which can be used for annotation, discovery, and guiding sequence generation.
Phage-related features in the E. coli K12 MG1655 genome preferentially activate for RefSeq-annotated prophages (on the left and upper right) and trigger on phage-derived spacers within CRISPR arrays (lower right).
Feature activations related to open reading frames (ORFs), intergenic sites, tRNAs, and rRNAs are shown in a 100-kb region of E. coli K12 MG1655.
In E. coli K12 MG1655, a region contains the tufB gene and a tRNA array ending with thrT (on the left), as well as the rpoB-rpoC site (on the right), showing characteristic activations related to α-helix, β-sheet, and tRNA. The figure also overlays structural predictions from AlphaFold 3 (AF3): on the left is the complex of EF-Tu with thrT tRNA, and on the right is the complex of RpoB with RpoC.
A feature in the human genome is more likely to be activated after a frameshift mutation than after mutations of less harmful types.
In the human genome, a series of features are activated at DNA motifs corresponding to transcription factor binding sites.
Features associated with exons, introns, and their boundaries in the human genome can be used to annotate the woolly mammoth genome.

Summary

All forms of life encode information through DNA. Although tools for genome sequencing, synthesis, and editing have revolutionized biological research, the intelligent construction of new biological systems also requires a deep understanding of the immense complexity embedded in genomes.

Evo 2 —— a biofoundation model trained on 9.3 trillion curated DNA base pairs spanning all domains of life. Evo 2 has 7B and 40B parameters, providing an unprecedented one-million-token context window and single-nucleotide resolution, capable of accurately predicting the functional impact of genetic variants from non-coding pathogenic mutations to clinically relevant BRCA1 variants based solely on DNA sequences without task-specific fine-tuning.

Evo 2 can autonomously learn various biological features such as exon-intron boundaries, transcription factor binding sites, protein structural elements, and prophage genomic regions. In addition to its predictive capabilities, Evo 2 can generate whole-genome sequences for mitochondrial, prokaryotic, and eukaryotic organisms, with better naturalness and coherence than previous methods. With search-guided inference, Evo 2 achieves controllable generation of epigenomic structures and demonstrates the effect of inference-time scaling in biology for the first time.

Evo 2 is fully open, including model parameters, training code, inference code, and the OpenGenome2 dataset, aiming to accelerate the exploration and design of biological complexity. Github link🔗: http://github.com/ArcInstitute/evo2

✍️Finally, to be honest, I didn't understand this article at all. All the above was interpreted by ChatGPT.