Modern machine learning algorithms combined with massive datasets of genomic sequences could enable a general biological foundation model that learns the intrinsic logic of whole genomes.
However, current efforts to model molecular biology with machine learning have been focused on creating modality-specific models. By contrast, complex biological processes, such as gene regulation, CRISPR immunity, or genetic transposition, rely on many interactions involving molecules across multiple modalities.
Here, we present Evo, a 7-billion-parameter genomic foundation model that is trained to generate DNA sequences at whole-genome scale and designed to capture two fundamental aspects of biology: the multimodality of the central dogma and the multiscale nature of evolution.
Evo also excels at multimodal generation tasks, which we demonstrated by generating synthetic CRISPR-Cas molecular complexes and transposable systems. We experimentally validated the functional activity of Evo-generated CRISPR-Cas molecular complexes as well as IS200 and IS605 transposable systems, representing the first examples of protein-RNA and protein-DNA codesign with a language model.
We make open-access code and models for Evo publicly available at https://github.com/evo-design/evo.
【MORE】