Genes or Junk: Measuring the Functional Genome

In 2001, scientists published a (mostly) complete sequence of the human genome, the DNA that’s spread over our 23 chromosomes and contains the information that dictates the function of our cells and the development of our bodies [1].  Over a decade later, the public still wonders what all of that information means.  And so do scientists.

The problem is that it’s far easier to read out the sequence of the genome than to decipher how it all works.  Even counting genes, the functional units of inheritance, isn’t easy.  Prior to publication of the genome sequence, scientists believed we had around 100,000 genes.  This count has since been reduced several times, and is currently holding at nearly a fifth of that number [2].  And gene counts, while one of the easiest parameters to measure, are just one part of the puzzle.

A gene is typically defined as a portion of DNA that makes a protein.  Some of your DNA codes for protein sequences directly, and biologists have a pretty good handle on how to identify those coding regions: we have roughly 20,000 of them, taking up around 1% of our genome [3].  Proteins control what your cells do, and your cells differ in both form and function because of variation in what proteins they make.  But: your cells all have pretty much the same genome, so what makes a heart cell different from a nerve cell is the subset of genes that they choose to use, or “express”, at any given time.  So, in addition to the coding sequences, we count as important the portions of DNA that help determine which protein-coding sequences get activated, and when.  Such portions of DNA are called regulatory sequences [Figure 1].

Figure 1 ~ A schematic summary of a chromosome.  Sequences that are under selection, including coding and regulatory sequences, are highlighted in blue.  Sequences that are not under selection, which may be junk DNA, are highlighted in green.  Scale of the indicated features changes throughout; scale bars are in black.

Identifying regulatory sequences is trickier than finding coding sequences in our DNA, but some such regions are relatively well-known.  Recent reports that include these sequences increase the part of our genome that appears active to 2-3%.  That’s better than just knowing that 1% codes for protein, but it still leaves 97% unexplained.  What does the rest do?

We can get more help by evaluating the genome through the lens of evolutionary biology.  If parts of our genomes are more important than others, they’re more likely to survive, copy themselves, and show up in our descendants; trivial portions will mutate and become random.  Since all life on earth came from a common ancestor long ago, we can get an idea of what’s important by looking at what was kept, or “selected for,” in other organisms, which took different evolutionary paths.  For the human genome, other mammals make a good comparison.  The first two mammals to have their genomes sequenced were mice and humans; they showed close agreement for 5% of their sequences, so evolution is hinting that at least that much of our genome is likely important [4].

In contrast, our closest living relatives, the chimpanzees and bonobos, have genomes that are between 96% and 99% similar to ours.  Does that mean that fully 96% of our genome is functional?  Maybe not.  The last common ancestors of humans and chimps lived around 5 million years ago (as opposed to roughly 75 million years for humans and mice).  Evolution proceeds at a finite rate, and sufficient time may not have passed since the divergence of humans and other primates to distinguish crucial regions of our DNA from those that are slowly drifting into randomness [5].

A more balanced approach would be to compare the human genome to a range of different mammals, identifying sequences that are similar or different between both close and distant relatives.  This is the method utilized in a recent paper from researchers at Oxford University in the United Kingdom [6].  In the past decade, sequencing genomes has become much faster and cheaper, and we have obtained the sequences of many more of our mammalian brethren.  On the basis of multiple comparisons between humans and other available mammalian genomes, the new results suggest that around 8% of the human genome has been subject to recent selection and is therefore functional.

This approach is not free from difficulty, however.  Many challenges occur during the analysis and modeling steps.  These issues could well influence the amount of sequence that’s estimated to be under selection.  The authors have confidence that their method is working though: they point out that the evolutionarily selected sequences identified by their computational approach fall more often within protein-coding genes than in presumed regulatory or non-coding DNA, as expected.

Another, broader question that persists in biology is whether large swaths of anonymous sequence could play a role in our bodies that is independent of their exact sequence.  While we may only share 8% of our genomes with the average mammal, it is striking that many other animals are similarly carting around long stretches of DNA with no known function that are largely unique between animal lineages.  In comparison, single celled organisms like bacteria and yeast have far less sequence about which biologists are puzzled: do big bodies benefit from having big genomes?  Many researchers regard the unstructured muddle of most of our genomes as true junk, but others suggest this sequence may have a role we simply don’t fully appreciate yet.

Some of the junk really does appear to be just that.  Our genomes contain the scars of ancient retroviral infections.  We also have “jumping genes” with no function other than to copy themselves into new locations in the genome, as well as enormous repetitive sequences that grow even longer when the genome replication machinery slips up and accidentally makes extra repeats [7].  These sequences may stick with us, not because they benefit us, but because they are so good at copying themselves, perhaps at our expense.  But then again, it’s interesting that a similar ratio of genes and junk predominates in other mammals, where the genes are familiar but the junk is very different.  Why do we all keep so much apparently useless DNA around?  Are we all victims of genomic parasites, or have our bodies co-opted unstructured DNA for a purpose that’s currently too clever for our minds to grasp?

While we have a nearly complete sequence for the human genome, we still don’t fully understand our genes and proteins.  We can’t predict the precise structure or function proteins, when and where they will be made, and how they are affected by pathogens and drugs.  Much of the information dictating which tissues express which genes, and when, is locked up in the less well-understood noncoding regions of our DNA.  We need a way to differentiate these from the other debris that appears to make up the bulk of our genomes.  Importantly, this new study predicts exactly where many of those important sequences are, marking them for further scrutiny.  If we can focus our attention on the subset of our genetic material that is being acted upon by evolution, we may move more quickly towards a comprehensive understanding of what’s inside us, and how to treat it when it fails.

Drew MacKellar is a postdoctoral fellow in the Systems Biology Department at Harvard Medical School.

References

[1] A guide to your genome.  National Human Genome Research Institute.  http://www.genome.gov/Pages/Education/AllAbouttheHumanGenomeProject/GuidetoYourGenome07_vs2.pdf

[2] Human Genome Shrinks To Only 19,000 genes.  Physics arXiv blog.  https://medium.com/the-physics-arxiv-blog/human-genome-shrinks-to-only-19-000-genes-21e2d4d5017e

[3] DNA Molecule: How Much DNA Codes for Protein?  DNA Learning Center.  http://www.dnalc.org/resources/3d/09-how-much-dna-codes-for-protein.html

[4] Waterston, R. H., Lindblad-Toh, K., Birney, E., Rogers, J., Abril, J. F., Agarwal, P., … Consortium, S. (2002). Initial sequencing and comparative analysis of the mouse genome. Nature, 420(6915), 520–62.

[5] http://en.wikipedia.org/wiki/Neutral_theory_of_molecular_evolution

[6] Rands, C. M., Meader, S., Ponting, C. P., & Lunter, G. (2014). 8.2% of the Human Genome Is Constrained: Variation in Rates of Turnover across Functional Element Classes in the Human Lineage. PLoS Genetics, 10(7), e1004525. doi:10.1371/journal.pgen.1004525

[7] Transposons and repetitive elements in the human genome.  DNA Learning Center.  http://www.dnalc.org/view/15308-Transposons-and-repetitive-elements-in-the-human-genome-Jim-Kent.html