A pangenomic approach allows for a more comprehensive characterization of genetic variations and can improve the analyses used by a wide range of researchers and clinicians. Credit: Elena.
The study of human genomes has relied on a single reference genome to identify genetic variations since the first sequence of the human genome more than 20 years ago. A single reference genome can't represent human diversity and it introduces a bias into studies. They have a practical alternative now.
The UC Santa Cruz Genomics Institute published a paper in December titled, "Giraffe, a tool that can efficiently map new genome sequence to a "pangenome" representing many diverse human genome sequence." They show that this approach allows a more comprehensive characterization of genetic variations and can improve the analyses used by a wide range of researchers and clinicians.
Benedict Paten, associate professor of biomolecular engineering at UC Santa Cruz and director of the Genomics Institute, said that they have been working toward this for years, and now for the first time they have something practical that works fast and works better than the single reference genome. It's important for the future of biomedicine that everyone is treated equally, so we need tools that account for the diversity of human populations.
There are many variations in the exact sequence of the genes, as well as in the vast stretches of the genome outside of them. A difference in a single letter of code is known as a single nucleotide variant and a difference in a short sequence is known as an indels.
Structural variations of large segments of code are the most complex. These are difficult to find, yet they can have significant effects and are known to play an important role in some diseases. The average person has millions of SNVs and indels and tens of thousands of larger structural variant, and collectively the structural variant is more letters of code than the other types of variant do.
Paten said that the workhorses of genomics have been SNVs and short indels. We can study structural variants the same way we study SNVs and short indels thanks to Pangenomics. There are a lot of structural variant that can have a big impact on the genetics of disease.
A mathematical graph structure can be used to represent the relationships between different genomes. Two human genome reference graphs were built in the new paper. These were used to evaluate the new tool, Giraffe, which is a set of algorithms for mapping new sequence data to a pangenome reference.
Many of Giraffe's key algorithmic innovations were pioneered by first author Jouni Sirén. New sequence data can be accurately mapped to thousands of genomes embedded in a pangenome reference as quickly as existing tools can. The study shows that using Giraffe reduces mapping bias and the tendency to map sequence that is different from the reference genome.
The pangenome reference graphs were constructed using variant calls from long read–based and large-scale sequencing studies. The accuracy, allele coverage balance, and speed of giraffe and competing mappers were evaluated. The variant call accuracy was evaluated after mapped reads were used. Structural variant calls and expression data were used to identify eQTLs. Sirén et al. are from Science 2021.
"Not only is the analysis better, it is also as fast as current methods that use a linear reference genome," said co-first author Jean Monlong.
Inexpensive short-read sequencing is a mainstay of modern genomics, yielding snippets of sequence that must be mapped to a reference genome to make sense of them. genotyping is the process of identifying the variant present at each location in an individual's genome.
The researchers found that DeepVariant could identify SNVs and indels more accurately using Giraffe's alignments against a pangenome than it could using alignments against a single reference genome.
Monlong said he was most excited about using pangenomics.
He said that a lot of structural variants have been discovered. We can look for structural variations in large datasets of short-read sequencing with pangenomes. This will allow us to study the new structural variant across many people and ask questions about their functional impact, association with disease, or role in evolution.
The researchers used Giraffe to map sequence reads from a diverse group of 5,202 people and determine their genes for 167,000 recently discovered structural variations. They were able to estimate the frequencies of different versions of the structural variant in the human population as a whole and within individual subpopulations. They showed that the frequencies of some variant are different between subpopulations and that it is possible to misinterpret them in European-ancestry populations where the frequencies of a particular variant is low.
The other versions of the reference genome are unrepresented. By making more broadly representative pangenome references practical, Giraffe can make genomics more inclusive.
The National Human Genome Research Institute is funding a project at the UC Santa Cruz Genomics Institute to build a comprehensive human pangenome reference, which they expect to release next year.
The new paper has three other co-first authors who contributed equally, including Xian Chang, Adam Novak, and Jordan Eizenga, all at the UC Santa Cruz Genomics Institute. Other coauthors include Director David Haussler of the Genomics Institute, as well as researchers at the Broad Institute of MIT and Harvard, University of Michigan, Harbor-UCLA Medical Center, and University of Tennessee Health Science Center.
Pangenomics enables genotyping of known structural variant in 5,202 diverse genomes. www.science.org/doi/10.1126/science.abg8871
Science journal information.
A new way to find genetic variations removes bias from genotyping.
The document is copyrighted. Any fair dealing for the purpose of private study or research cannot be reproduced without written permission. The content is not intended to be used for anything other than information purposes.