Video: a 3D approach gives us a much better window into DNA and genetic builds
If you’re involved in genomics, you might want to pay attention to this video where we get a very useful metaphor for sequencing and other kinds of genetic research work.
Starting out, you get this comparison from CSAIL Research Scientist Rohit Singh about how it’s hard to make a book into a movie – and by the same token, how it might be easier to access a book than a movie through certain interfaces.
Then he reveals that what he’s talking about corresponds to the genomic sequencing, which is linear, compared to the 3D modeling of protein structures and that type of data, which is more robust, and needs a different kind of “reading” and/or modeling.
We can’t really see the three dimensional model very well.
“Our cells live in 3d,” Singh says. “And the actors in the movie of the cells are proteins. They hold up the cells, they catalyze reactions, they bring in signals from outside. And our understanding of how proteins operate and look in 3d is very poor. And we can’t get to it just from sequence, not easily.”
However, he suggests, we can read the book: the sequencing, as Singh notes, is linear. He also points out that sequencing costs are decreasing for the types of data sets involved, but with a disclaimer: developing the robust data from what you have is, again, difficult.
“That is the grand challenge,” he says. “How do you get to a protein structure and function from its sequence? And just getting structure is not enough, a single structure of the protein doesn’t tell you everything.”
It is, he says, a long-standing problem. Singh poses an approach:
“One way we … formalize this is saying, ‘I give you a sequence,’” he says. “How can I edit the sequence to preserve its structure and function? So for example, what could I change that amino acid to, while preserving its structure and function?”
Answers, he says, can come from the study of evolution and looking at mirror processes for multiple species.
Evolution gives us “distributional semantics” as he calls it: a language model that can enhance the road maps scientists are using.
If you see where Singh references a ‘masked’ language model, he shows how the same type of thing can inform genomic research, although there are differences. That next-word strategy that NLP LLMs use, he says, might not be best for genomics. (take a look)
Going into some applications of transfer learning, we see the example of answering the question: does a given drug bind to a given protein?
By putting drugs and proteins in the same system, Singh observes, you can process 100 million interactions per day.
We can see the effect this will have on drug discovery and research!
Now we just have to apply these new solutions to what we’re already doing with genetics.
“We are relatively limited (in) protein data,” Singh tells us. “And that has been a challenge in making good models of what a protein can do. But what we can do now is, we can train these foundation models on these large corpuses of sequences – on that we can do … transfer learning and get really high-quality and accurate predictions.” (He also references few-shot learning, another holdover from NLP).
That’s the idea, in a nutshell. But go back and watch the whole thing, and you’ll see some of the strategy detail and context that’s driving this innovative work. Singh leaves us with this as a takeaway:
“Applying AI to learn how to speak the language of proteins can help us make significant advances in drug discovery,” he says.
It certainly should be a game-changer for drug discovery, but this idea could also lead to all kinds of other big advances, when you think about genetic modeling and its centrality to diverse modern research. Let’s keep an eye on this as the method and the concept evolves.
Read the full article here