Scientists have developed an AI system capable of simulating hundreds of millions of years of protein evolution, creating a novel fluorescent protein unlike any found in nature.
The research team, led by Alexander Rives at EvolutionaryScale, created a large language model (LLM) called ESM3 to process and generate information about protein sequences, structures, and functions.
By training on data from billions of natural proteins, ESM3 learned to predict how proteins might evolve and change over time.
“ESM3 is an emergent simulator that has been learned from solving a token prediction task on data generated by evolution,” the researchers explain in the study.
“It has been theorized that neural networks discover the underlying structure of the data they are trained to predict. In this way, solving the token prediction task would require the model to learn the deep structure that determines which steps evolution can take, i.e. the fundamental biology of proteins.”
To test the model, the team prompted ESM3 to design an entirely new green fluorescent protein (GFP) — a type of protein responsible for bioluminescence in certain marine animals and widely used in biotechnology research.
The AI-generated protein, dubbed esmGFP, shares only 58% of its sequence with the most similar known fluorescent proteins.
Remarkably, esmGFP exhibits brightness comparable to naturally occurring GFPs and maintains the characteristic barrel-shaped structure essential for fluorescence.
The researchers estimate that producing a protein this distant from known GFPs would have taken over 500 million years of natural evolution.
More about the study
The process of generating esmGFP involved several key steps:
- Data: Researchers trained ESM3 on approximately 2.78 billion natural proteins collected from sequence and structure databases. This included data from UniRef, MGnify, JGI, and other sources.
- Architecture: ESM3 uses a transformer-based architecture with some modifications, including a “geometric attention” mechanism to process 3D protein structures.
- Prompting: The researchers provided ESM3 with minimal structural information from a template GFP (the fluorescent protein).
- Generation: ESM3 used this prompt to generate novel protein sequences and structures through an iterative process.
- Filtering: Thousands of candidate designs were computationally evaluated and filtered to find the strongest candidates.
- Experimental testing: The most promising designs were synthesized and tested in the lab for fluorescence activity.
- Refinement: After identifying a dim but distant GFP variant, the researchers used ESM3 to further optimize the design, ultimately producing a brighter fluorescent protein.
The implications of this research extend beyond the creation of a single novel protein.
ESM3 demonstrates an ability to explore protein design spaces far removed from what natural evolution has produced, opening up new avenues for creating proteins with desired functions or properties.
Dr. Tiffany Taylor, Professor of Microbial Ecology and Evolution at the University of Bath, who was not involved in the study, told LiveScience: “Right now, we still lack the fundamental understanding of how proteins, especially those ‘new to science,’ behave when introduced into a living system, but this is a cool new step that allows us to approach synthetic biology in a new way.”
“AI modeling like ESM3 will enable the discovery of new proteins that the constraints of natural selection would never allow, creating innovations in protein engineering that evolution can’t,” Dr. Taylor added.
Generative protein design
The researchers argue that ESM3 is not simply retrieving or recombining existing protein information.
Instead, it appears to have developed an understanding of the fundamental principles governing protein structure and function, allowing it to generate truly novel designs.
AI-driven protein research and design has reached a fever pitch, with DeepMind’s AlphaFold 3 predicting how proteins fold with incredible accuracy.
AI-designed proteins have also shown excellent binding strength, showcasing that they have practical uses.
However, like with any fast-moving technology that in some way interferes with biology, there are risks.
First, if AI-designed proteins were to escape into the environment, they could potentially interact with natural ecosystems, even outcompeting natural proteins or disrupting existing biological processes.
Second, they could trigger unexpected interactions within living organisms, potentially even creating harmful biological agents or toxins.
Researchers recently called for ethical guardrails for AI-protein design to prevent risky outcomes in this exciting, if unpredictable, field.