A New AI Model Learns DNA’s Hidden Language

Ever since the double helix was discovered, researchers have worked to decipher the information contained inside DNA.

DNA contains foundational information needed to sustain life. Understanding how this information is stored and organized has been one of the greatest scientific challenges of the last century. With GROVER, a new large language model trained on human DNA, researchers could now attempt to decode the complex information hidden in our genome. Developed by a team at the Biotechnology Center (BIOTEC) of the Dresden University of Technology, GROVER treats human DNA as a text, learning its rules and context to draw functional information about the DNA sequences. This new tool, published in “Nature Machine Intelligence”, has the potential to transform genomics and accelerate personalized medicine.

Ever since the double helix was discovered, researchers have worked to decipher the information contained inside DNA. It is evident, seventy years later, that the information encoded in DNA is complex. The genes, or the sequences that code for proteins, make up only 1% to 2% of the genome.

DNA has many functions beyond coding for proteins. Some sequences regulate genes, others serve structural purposes, most sequences serve multiple functions at once. Currently, we don’t understand the meaning of most of the DNA. When it comes to understanding the non-coding regions of the DNA, it seems that we have only started to scratch the surface. This is where AI and large language models can help.

Dr. Anna Poetsch

DNA as a Language

GPT and other large language models have revolutionised our understanding of language. The huge language models learnt how to utilise the language in a variety of settings by being trained only on text.

DNA is the code of life. Why not treat it like a language?

Dr. Anna Poetsch

GROVER, an acronym for “Genome Rules Obtained via Extracted Representations,” is the resultant technology that may be utilised to deduce biological significance from DNA.

GROVER learned the rules of DNA. In terms of language, we are talking about grammar, syntax, and semantics. For DNA this means learning the rules governing the sequences, the order of the nucleotides and sequences, and the meaning of the sequences. Like GPT models learning human languages, GROVER has basically learned how to ‘speak’ DNA.

Dr. Melissa Sanabri

The researchers demonstrated that GROVER is capable of precisely predicting the subsequent DNA sequences as well as extracting biologically significant contextual information, such as the location of protein binding sites or gene promoters on DNA. Additionally, GROVER picks up on what is commonly referred to as “epigenetic” processes hat is, regulatory functions that occur on top of DNA rather than being encoded.

It is fascinating that by training GROVER with only the DNA sequence, without any annotations of functions, we are actually able to extract information on biological function. To us, it shows that the function, including some of the epigenetic information, is also encoded in the sequence.

Dr. Sanabria

DNA resembles language. It has four letters that build sequences and the sequences carry a meaning. However, unlike a language, DNA has no defined words,

Dr. Anna Poetsch

Genes and the four letters A, T, G, and C make up DNA; nevertheless, predetermined sequences of varying lengths do not combine to form genes or other significant sequences.

The group had to first produce a DNA lexicon in order to train GROVER. They employed a compression algorithm technique.

This step is crucial and sets our DNA language model apart from the previous attempts.

Dr. Anna Poetsch

We analyzed the whole genome and looked for combinations of letters that occur most often. We started with two letters and went over the DNA, again and again, to build it up to the most common multi-letter combinations. In this way, in about 600 cycles, we have fragmented the DNA into ‘words’ that let GROVER perform the best when it comes to predicting the next sequence.

Dr. Sanabria

Also, Read| New effective drug against flesh-eating bacteria

GROVER aims to decipher the many genetic code tiers. Important information about what makes us human, our susceptibilities to disease, and how we react to therapies are encoded in our DNA.

We believe that understanding the rules of DNA through a language model is going to help us uncover the depths of biological meaning hidden in the DNA, advancing both genomics and personalized medicine.

Dr. Anna Poetsch

Source: Technische Universität Dresden – News

Journal Reference: Sanabria, Melissa, et al. “DNA Language Model GROVER Learns Sequence Context in the Human Genome.” Nature Machine Intelligence, 2024, pp. 1-13, DOI: https://doi.org/10.1038/s42256-024-00872-0.


Last Modified:

Editor's Desk

Next Post

Diet and exercise can prevent type 2 diabetes in individuals with high genetic risk

Fri Aug 9 , 2024
Written by Editor's Desk The first study in the world, conducted recently at the University of Eastern Finland, demonstrates that even in those with a high hereditary risk, type 2 diabetes may be prevented with a balanced diet and frequent exercise. Put another way, regardless of hereditary risk, everyone gains from […]
Diabetes Mellitus - Types, Causes and Symptoms

Related Articles

Skip to content