Sequence modeling and design from molecular to genome scale with Evo

Updated

Modern machine learning algorithms combined with massive datasets of genomic sequences could enable a general biological foundation model that learns the intrinsic logic of whole genomes.

However, current efforts to model molecular biology with machine learning have been focused on creating modality-specific models. By contrast, complex biological processes, such as gene regulation, CRISPR immunity, or genetic transposition, rely on many interactions involving molecules across multiple modalities.

Here, we present Evo, a 7-billion-parameter genomic foundation model that is trained to generate DNA sequences at whole-genome scale and designed to capture two fundamental aspects of biology: the multimodality of the central dogma and the multiscale nature of evolution.

Evo also excels at multimodal generation tasks, which we demonstrated by generating synthetic CRISPR-Cas molecular complexes and transposable systems. We experimentally validated the functional activity of Evo-generated CRISPR-Cas molecular complexes as well as IS200 and IS605 transposable systems, representing the first examples of protein-RNA and protein-DNA codesign with a language model.

We make open-access code and models for Evo publicly available at https://github.com/evo-design/evo.

【MORE】