## Using models to link different varieties of data

September 3, 2010 § Leave a comment

One of the fundamental problems in systems biology is that many important decisions get made at the level of individual cells, but most of the measurement techniques we have (mass spectroscopy, Western blots…) report data on the behavior of populations. And much of the time it’s hard to relate what you see on the single-cell level to what you see at the population level. A recent paper (Pfeifer et al. 2010 Model-based extension of high-throughput to high-content data. *BMC Syst Biol.* **4** 106.PMID: 20687942) aims to address this issue to some extent, by providing a method for comparing and merging data from microscopy and FACS analysis.

Even if this paper hadn’t said anything else interesting, I would still want to cover it because of the first sentence of the abstract, which states:

“High-quality quantitative data is a major limitation in systems biology.”

It’s so refreshing, given the dominance of the meme that biologists are drowning in floods of data, to see an acknowledgement that often the reverse is the case: we’re starving [I guess that should be thirsting, but it doesn’t have quite the same resonance] for the right data. This is why much of this blog is about new ways to measure things, quantitatively, so that we can get a quantitative description of how the biological circuits we aim to understand are behaving. It’s true that there’s a lot of low-quality (by which I don’t mean carelessly done, but low information content) data about; many brave souls are addressing the problem of extracting meaningful information from these large datasets. But in case you were in any doubt, no, this is not the time for experimentalists to down tools and retrain as computational biologists. The ability to make quantitative measurements is still a limiting factor in understanding biological systems.

Let’s move on to sentence #2. The overall idea of this paper is as follows: if you can measure a given quantity, say cell volume, using two different methods, and if the process you’re interested in (call it X) depends on that quantity (Y), then you can relate the measurements made by the two methods. If one of the methods (in this case microscopy) provides the ability to measure both X and Y and how they relate in single cells, and the other (in this case FACS) provides statistics on how Y varies across a population, then you can measure X at the single-cell level and use a conversion via Y to make estimates of what’s happening to X at the population level. Then you have a new way to use the other population-level tools you have available, such as measurements of the phosphorylation state of proteins involved in the process, to understand how the data you can measure using these other tools is related to the behavior of your process. This is not quite a complete story at this point, but the thinking behind the paper is interesting; and since the Sorger lab has been struggling with the same problem for a while, some of you may even have ideas about how to take the story further.

To relate the two sets of measurements, you first measure the same cell population using both methods (microscopy and FACS). The Y quantities — the quantities we will use for the conversion — in this case are cell volume and the concentration of a fluorescent marker, and what’s most important is the distribution of each value. The authors show that you can plot the quantile distributions from the two different methods against each other and get a conversion factor that allows you to directly compare the microscopic measurement to the FACS measurement. In essence you are taking advantage of the variability of the quantity measured to create a conversion factor.

[A word of warning: this paper is much easier to understand if you read the HTML version instead of downloading it as a PDF. This is because there are multiple errors in the PDF version, including missing out equation (2) and rendering m’ and d’ as m and d throughout. Thus the slope of the quantile-quantile plot is given as m/m — which even I can see is not particularly helpful. I would say that the publishers should be ashamed of themselves, but since this is an open-access paper and the publishers have charged what the market will bear for the author fee — alas, not enough to pay for decent subediting — I guess this is the kind of problem we will all have to get used to.]

Pfeifer et al.’s goal was to apply this conversion approach to the problem of what controls the nucleocytoplasmic cycling of the transcription factor STAT5B. It’s assumed to be imported through nuclear pores via a so-far-unidentified importin factor, and exported, again through nuclear pores, in a process mediated by the exportin CRM1. The balance between import and export is believed to be regulated by phosphorylation. The dynamics of this process are hard to measure: you can look at the process of cycling itself by microscopy, using fluorescence recovery after photobleaching (FRAP), but FRAP data can only be readily interpreted if the system is close to steady state. The system is only at or near steady state in resting cells, and so the microscopy measurement is limited to the behavior of non-phosphorylated STAT5B. The goal of this paper, then, is to measure import and export of unphosphorylated STAT5B at steady state, which can then be used as a fixed value in a larger pathway model. To be useful in the larger model, the steady-state behavior must be relatable to population information. This can be done if (1) cell volume and concentration of STAT5B (measured using a GFP fusion) can be measured on the same population by FACS and microscopy; and (2) if the import/export dynamics depend on cell volume and concentration of STAT5B.

It’s reasonable, of course, that dynamics might depend on STAT5B concentrations. Indeed, the authors show that it does; import can be modeled as a saturable process, increasing linearly at low STAT5B-GFP concentrations (which are controlled by a Tet-inducible promoter) and saturating at high STAT5B-GFP concentrations. Why should the dynamics depend on volume? First, be aware that the measurement being used is transport “current”, i.e. the number of molecules moving from nucleus to cytoplasm, or vice versa, per second. This number presumably depends on at least a couple of parameters: the number of nuclear pores, and how many of the relevant import/export factors are available. Pfeifer et al. assume that the availability of import factors is proportional to the size of the cytoplasm, and that the availability of export factors is proportional to the size of the nucleus. Now, this is clearly not exactly true on the single-cell level since the concentrations of proteins in individual cells varies. But how important this variability is isn’t clear, so let’s follow the argument and see where it leads us.

Pfeifer et al. next ask how well their data fit to each of three hypotheses: the largest factor in determining the current is (1) the surface area of the nucleus (as a proxy for the number of nuclear pores); (2) the cytoplasmic volume (representing the number of importins) or (3) the nuclear volume (representing the number of exportins). They find an extremely good fit between hypothesis (2) and the data on the import current. For the export current they can’t distinguish between hypothesis (2) and hypothesis (3). (Hypothesis (1) doesn’t fit well for either import or export). Acknowledging this, the authors still chose to go forward with the idea that the number of transport molecules in the originating compartment is the most significant determining factor in the measured transport current, for both nuclear import and nuclear export, and that these numbers scale with the size of the compartment.

Now the question is whether you can use these measurements to propose what the distribution of import currents and export currents are likely to be across a population of cells measured by FACS. To get anywhere with this you have to do a few manipulations: for example, very dimly fluorescent cells can be seen in FACS but not using their microscopic measurements, so those cells have to be removed from the FACS data. And so on. But once these manipulations are done, the quantile-quantile plot is linear, as it should be if the two techniques are measuring the same quantity. But, they can’t measure cytoplasmic volume and nuclear volume directly by FACS. Instead they had to estimate the average ratio of cytoplasm to nucleus using microscopy data. The variance in this ratio is not huge; the volume of the cytoplasm divided by the volume of the nucleus (done on a single-cell basis, then averaged) is 4.27 + 0.11. So this is probably a reasonable approximation to use.

Using these assumptions, they are able to create a calculated estimate of the import and export currents, based on cell size and fluorescence, for the whole FACS distribution. Unfortunately, by the very nature of the work, this estimate can’t, yet, be tested. The question of how useful it is will have to wait for future studies that build on this estimate; if it helps provide insight, then the chain of reasoning described here will be in some sense validated. The value of the approach might be easier to test in a system where it’s known how the property of interest depends on cell volume/protein concentration. Ideas, anyone?

Pfeifer AC, Kaschek D, Bachmann J, Klingmüller U, & Timmer J (2010). Model-based extension of high-throughput to high-content data. BMC Systems Biology, 4 (1) PMID: 20687942

## Leave a Reply