https://www.nature.com/articles/d41586-022-03539-1?utm_source=substack&utm_medium=email

d41586-022-03539-1_23662838.jpg

Microbial molecules from soil, seawater and human bodies are among the planet’s least understood.

The ESM Metagenomic Atlas contains structural predictions for 617 million proteins.Credit: ESM Metagenomic Atlas (CC BY 4.0)

When London-based artificial-intelligence (AI) company DeepMind unveiled predicted structures for some 220 million proteins this year, the trove covered nearly every protein from known organisms in DNA databases. Now, another tech giant is filling in the ‘dark matter’ of the protein universe.

Researchers at Meta (formerly Facebook, headquartered in Menlo Park, California) have used AI to predict the structures of some 600 million proteins from bacteria, viruses and other microorganisms that haven’t been characterized.

“These are the structures we know the least about. These are incredibly mysterious proteins. I think they offer the potential for great insight into biology,” says Alexander Rives, the research lead of Meta AI’s protein team.

The scientists generated the predictions — described in a 1 November preprint1 — using a ‘large language model’, a type of AI that can predict text from just a few letters or words.

Normally, language models are trained on large volumes of text. To apply them to proteins, Rives and his colleagues instead fed the AI sequences of known proteins, which can be written down as a series of letters, each representing one of 20 possible amino acids. The network then learnt to fill in the sequences of proteins in which some of the amino acids were obscured.

Protein ‘autocomplete’

This training imbued the network with an intuitive understanding of protein sequences, which contain information about their shapes, says Rives. A second step — inspired by DeepMind’s pioneering protein-structure-predicting AI, AlphaFold — combines such insights with information about the relationships between known protein structures and sequences, to generate predictions.

Meta’s network, called ESMFold, isn’t quite as accurate as AlphaFold, Rives’ team reported earlier this year2, but it is about 60 times faster at predicting structures for short sequences, he says. “What this means is that we can scale structure prediction to much larger databases.”

As a test, the researchers unleashed their model on a database of bulk-sequenced ‘metagenomic’ DNA from environmental sources such as soil, seawater and the human gut and skin. The vast majority of the entries — which encode potential proteins — come from single-cell organisms that have never been isolated or cultured and are unknown to science.

In total, the team predicted the structures of more than 617 million proteins. The effort took just two weeks (by contrast, AlphaFold can take minutes to generate a single prediction). The structures are freely available for use, as is the code underlying the model, says Rives.

What's next for AlphaFold and the AI protein-folding revolution

Of the 617 million predictions, the model deemed more than one-third to be high quality, such that researchers can have confidence that the overall protein shape is correct and, in some cases, can discern atomic-level details. Millions of these structures are entirely unlike anything in the databases of protein structures determined experimentally, or any of AlphaFold’s predictions from known organisms.