How to read a genome
February 17, 2011 § Leave a comment
What makes you the unique human being you are? Partly it’s nurture — what your mother ate while she was pregnant with you, whether she smoked, how much you exercise, which drugs you take — and partly it’s nature. The part that’s nature is sometimes clearcut — if your biological father and mother both had the O negative blood type, you do too — and sometimes not. If your mother is tall and your father short, I can’t make any kind of confident prediction about how tall you are, even leaving aside the effects of nutrition. Height is a “quantitative trait”, a quality that is largely inherited but not controlled by a single gene. Unlike Mendel’s peas, which showed a digital (or qualitative) phenotype — either wrinkled or smooth — quantitative traits are analog, showing a smoothly variable degree of (say) wrinkled-ness. Such characteristics — like skin color or risk of adult-onset diabetes — are presumptively controlled by polymorphisms in multiple genes or loci, each making a small or medium-sized contribution to the eventual outcome. Genome-wide association studies (GWAS) aim to identify these loci and determine how much of a contribution they make to the trait; there’s currently a furious debate in the field over the issue of “missing heritability” (the gap between the expected genetic contribution to a trait and the sum of the genetic contributions identified by GWAS) and the question of how many of the polymorphisms contributing to disease are likely to be common (possible to identify via GWAS) and how many are likely to be rare.
But let’s leave missing heritability aside and look at a different GWAS-related question. Suppose you have found a locus that’s clearly associated with a given trait, for example the risk of developing a disease. You have a list of genetic polymorphisms at that locus that are positively or negatively associated with the disease in the population you studied. The next problem you face is that not all of these associations are going to be real: some polymorphisms will be causal, changing either the sequence or the expression level of the protein or RNA that mediates the increase in disease risk, and others will only appear to be associated with the phenotype because they’re correlated with the causal ones in the population you happened to study. If we want to be able to read an individual genome and determine whether the person who owns the genome has an elevated risk of a disease, we need to know which polymorphisms are really driving the behavior of the trait. And it would be nice if there were only a few important ones, with the majority of the list showing up because of correlations.
Angela DePace pointed out an interesting recent paper that takes on this challenge for a quantitative trait in Drosophila, the sex-specific color patterns on the abdomens of adult females (Bickel et al. 2011 Composite effects of polymorphisms near multiple regulatory elements create a major-effect QTL. PLoS Genetics, 7, e1001275). This particular trait is known to have a strong association with the bric a brac locus, so the authors took 96 D. melanogaster lines that vary in color pattern, sequenced the bric a brac locus for each line, and set out to figure how the variation at the locus led to the variation in color pattern.
The bric a brac locus is ~150kb long, and the number of polymorphisms the authors found in it in these 96 lines is large, around 7,000 in total. Thus, this locus offers a reasonable facsimile of the kind of situation we’re likely to face in the human genome once an association has been identified via GWAS. Bickel et al. discarded the rarest polymorphisms, since it’s difficult to use association studies to correlate these with phenotypic variation (of course, this doesn’t mean that they’re not important; see GWAS controversy above). Of the remaining ~4,000 polymorphisms, about 250 turned out to have a significant association with color variation, all of which were found in the non-coding regions of the locus.
How many of these are causal, and how many are just along for the ride? The authors have previously shown that this particular locus has a rather low level of linkage disequilibrium, the tendency for polymorphisms in one part of the locus to drag other polymorphisms along with them. This makes it relatively easy to ask who’s driving. Based on the pattern of linkage, Bickel et al. estimate that there must be at least 50 polymorphisms (or groups of polymorphisms) that are independently associated with the color variation and therefore likely to be causal. So the good news is that you don’t have to worry about all 7,000 polymorphisms; the bad news is that there are probably at least 50 important ones, not just a handful as we might have hoped.
The polymorphisms likely to be important fall into three regions: around the cis-regulatory element (CRE) that controls sex-specific expression of the two genes in the locus, bab1 and bab2; around a polycomb response element (PRE) between the two genes; and around the promoter and transcription start site of bab2. Each of these makes sense, in that it’s easy to see how changes in these areas would affect the level of expression of one or more of the genes: the CRE directly drives expression, the PRE affects chromatin state and thus indirectly affects expression levels, and promoter structure can affect the rate of initiation. At the same time, though, most of the polymorphisms aren’t actually in the relevant functional element; they’re just close to it. You might have guessed that changing transcription factor binding affinities might be a pretty common way of altering transcriptional activity, for example, but actually none of the common polymorphisms around the CRE falls within a transcription factor binding site. So clearly we still have a lot to learn about how DNA sequence controls transcription.
The other slightly depressing point is that each individual polymorphism appears to explain only a small part (~1%) of the overall effect. Altogether, the set of polymorphisms identified as significant explains ~84% of the variation in color. So, though this locus is very important in determining the color pattern in the individual fly, no individual polymorphism has a major effect. We’re going to need to look at all 50 to be able to make a strong prediction.
Bickel et al. were able to show that about 10 of their polymorphisms were associated with changes in transcription of bab2 at a specific developmental stage. For these polymorphisms, we now have a straightforward mechanistic hypothesis: they cause an increase or decrease in the production of bab2 protein, which acts as a transcription factor, causing changes in expression of other proteins. With a little more work (well, a lot more work) on these polymorphisms, we might be able to get to the point where we can read off DNA-level changes and make a confident prediction of the (small) downstream effect on pigmentation.
This leaves the effect of at least 40 other independent polymorphisms unaccounted for, however; perhaps they affect transcription at a different developmental stage. Another possibility is that the effect of these polymorphisms on transcription is very specific to a subset of cells (perhaps a subset of cells that expresses a specific co-activator or repressor), and is therefore hard to see when mashing up whole fly abdomens. Bottom line: even quantitative effects that are largely explained by a single locus can be really complicated… and even for well-studied traits we probably have only a small part of the story.
Usually when I’m on the verge of getting depressed by the sheer complexity of biology I try to think of it as job security; but unfortunately I’ve been reading too much political news to feel that way just now. Nevertheless, looking firmly on the bright side, as Ben Simons points out it’s not all that likely that we’ll run out of interesting questions in biology (unlike physics) any time soon.
Bickel RD, Kopp A, & Nuzhdin SV (2011). Composite effects of polymorphisms near multiple regulatory elements create a major-effect QTL. PLoS genetics, 7 (1) PMID: 21249179