Mining DNA for Disease Prediction: The polygenic risk score

by Alex Yenkin
figures by Allie Elchert

Long before the completion of the Human Genome Project, scientists knew that many common diseases had a genetic component. However, there was debate about the architecture of these genetic effects: were there a few high-effect mutations or thousands of tiny effect mutations spread throughout the genome? Now, in the full swing of the genomics revolution, we can see that those making the latter argument were correct. Though some diseases have simple genetic causes, many of the biggest diseases—cancer, heart disease, diabetes—are genetically complex. Recently, a tool has emerged to summarize mutations of the genome that contribute to a disease: the polygenic risk score (PRS). The PRS has been reported as a huge potential breakthrough for medical genetics, though also with appropriate skepticism. To understand why, we need to look at what PRS is, where it comes from, and what its pitfalls are.

The polygenic risk score is an evolution of medical genetics

Much of the history of medical genetics has been finding one-to-one disease relationships: mutations in the lysosomal gene HEXA cause Tay-Sachs disease, mutations in the BRCA1 gene greatly increase risk for breast and ovarian cancer, and the addition of a DNA repeat at a specific spot in the HTT gene causes Huntington’s disease. In a time when genetic sequencing was arduous and human data were lacking, researchers were able to find these relationships because these mutations’ effects are blunt and visible, making it possible to hunt down the mutation by tracing it through an affected family.

Now, with genomic sequencing becoming more accessible, we can also look for tiny effect mutations, such as a mutation that makes you 1% more likely to develop breast cancer (unlike the monstrous 300% increase from a BRCA1 mutation). These effects are minuscule when taken individually, but together, they can substantially influence the risk of a condition. This is the essence of the polygenic risk score: taking together the effects of many (hence the “poly”) mutations to get a single risk score.

Polygenic risk scores are calculated using genome-wide association studies

As is the case with any statistic, it’s important to understand how it is calculated. For the PRS, the origin is the Genome-Wide Association Study (GWAS, pronounced GEE-wahs), a popular method for looking for genetic associations with a disease. In a GWAS, researchers look within a large population at differences between hundreds of thousands of individual DNA nucleotides, the building blocks of DNA (these variations are called single nucleotide polymorphisms, or SNPs). These SNPs are spread all throughout the genome, across all 23 pairs of chromosomes, the structures in the cell that house DNA. Researchers look for a correlation between the SNP and the condition at hand, after controlling for many variables, including genetic relationships and the environment.

Researchers only have to look at hundreds of thousands of SNPs instead of all 3 billion nucleotides in the genome because of the way DNA is passed down. Over generations, chromosomes exchange parts of themselves in a process called recombination; the way that chromosomes recombine causes regions of DNA close together on the chromosome to stick together (Fig. 1). This means that in GWAS, SNPs are a stand-in for variations in their surrounding regions, similar to someone showing up to a city council meeting claiming to speak for all of her neighbors. After pinpointing which SNP-linked regions are relevant in the GWAS, researchers can then go back and do more careful analysis to find the precise causal mutations.

**Figure 1: Chromosomes recombine across generations.** Over time different versions of the same chromosome in a population will experience recombination and shuffle with one another. The parts of the chromosome closer to a disease-causing mutation are more likely to be consistently the same sequence. In a GWAS, you use SNPs as markers to find correlated regions in the chromosome.

PRS takes all of the associations found in a GWAS and adds up their effects. This sum can add up to a lot! Studies have found that a high PRS for breast cancer can increase risk by as much as a single BRCA1 mutation. Comparable increases in risk for people with high PRS have been found for multiple other diseases.

Population structure needs to be taken into account for study accuracy

One of the big caveats with any kind of research into the effects of genetics in a large population is how much it can be affected by population structure–how people in a population are related or tied to common ancestry. In one study looking at a PRS for schizophrenia, the PRS varied greatly between people with European ancestry and people with African ancestry, even more than it varied between people with and without schizophrenia. This result means that if the PRS were put into practice without accounting for this difference, the predictions would be wildly inaccurate.

How can this happen? GWAS depend on genetic correlations, but people who are relatives or come from the same ancestry also have correlated genetics in a way that has nothing to do with disease. The way that GWAS control for this isn’t perfect, and still leaves room for population structure to cause small differences in the calculated effect of a SNP. For example, certain SNPs could be more common in certain parts of the world, or people from the same ancestry could be more likely to share an environment that affects a given disease. Recombination is also a random process, so the genomic region that a SNP is standing in for isn’t an identical region across the world.

For a single site in a GWAS, these differences are likely to be relatively small. However, a PRS adds together the effects of many sites, so those differences are heavily amplified. This can get to the point where a PRS is essentially incomparable between people with different ancestry than the original GWAS population. Currently, most GWAS data are from European populations, so the application of specific scores as they are now to other groups of people is murky.

Many people who study polygenic risk scores dream of giving them a clinical application—that is, to use a patient’s genetic data to help understand their individual risk for a disease. This seems promising in the case of certain diseases, like breast cancer and coronary artery disease, which are more genetic in nature. In these cases, the PRS would be another piece of information your doctor would use to institute a medical intervention, such as regular mammograms or a prescription for statins (Fig. 2). On the other hand, some diseases, like bladder cancer or depression, are so much more strongly affected by the environment, that using a PRS likely wouldn’t change a doctor’s course of action.

**Figure 2: PRS clinical application.** In this example of how a PRS could be applied clinically, people with a top 5 percentile PRS for coronary artery disease are identified (left). In the clinical decision making process, many risk factors are incorporated, taking into account that those with a high PRS have higher risk, but not all are high risk overall (center). Those above a certain risk threshold will be given a clinical intervention, in this case a prescription for statins. Not everyone with a high PRS will be deemed “high risk”, and some high-risk people can still have a lower PRS (right).

How can we apply polygenic risk scores to individuals?

PRS tests still have a long way to go before they are ready for regular use. Given the data we have now, scientists can’t guarantee that a PRS test would work equally on all groups of people, which is a serious problem. Even if that were not an issue, the predictive ability of a PRS is measured at the population level, averaging across many people. People with a high PRS for a disease have higher risk for a disease on average, but an individual with high PRS can still be fairly low risk. That means that PRS should mostly be used together with other clinical and social factors, not as a standalone genetic determinant.

Beyond biomedical research and clinical use, the PRS has also become popular in more ethically fraught areas. Several companies have started offering embryonic PRS screening for people undergoing in vitro fertilization, a selection practice that has been heavily criticized for ethical and scientific reasons. A polygenic score can also be found for many other traits, not just diseases; many researchers in biology and other fields have created polygenic scores for much more socially determined traits, like IQ and educational attainment, which has proved controversial.

Despite all this, scientists are still optimistic that we can use the PRS to understand the genetics of disease. There is constant new development in statistical techniques and study designs to tease apart the true genetic signals in GWAS. Many researchers also hope that more studies using diverse data sets will help PRS become more applicable across populations. We are constantly learning just how complex the human genome is, so we’ll need equally complex methods to understand it.

Alex Yenkin is a 1^st year PhD student in the Bioinformatics and Integrative Genomics program at Harvard Medical School

Allie Elchert is a third-year Ph.D. candidate in the Biological and Biomedical Sciences program at Harvard Medical School

Cover image by swiftsciencewriting from pixabay

For More Information:

If you want a more lengthy explanation on the effects of population structure and environment on PRS and GWAS, check out this blog post.
To read more about using polygenic risk scores in a clinical setting, check out this article.
If you want to read more about the effects of population structure on polygenic risk score, read this study.