## The laws of averages

February 7, 2012 § 3 Comments

The Hitchhiker’s Guide to the Galaxy, that truly re*mark*able book, points out that since the area of the universe is infinite and the number of populated worlds is finite, the population of the universe is, on average, none. So although you might see people from time to time, they are most likely merely products of your imagination. Arguing from averages is always tricky; many people in the department are fixated on the question of what happens when the average is not a good surrogate for what’s happening to the individual, as for example when there are two populations behaving in distinct ways and the average captures neither behavior. But a recent paper argues that there is quite a lot you can deduce about the physical limits to cell behavior by knowing the average behavior of the proteins that make up the cell (Dill, Ghosh and Schmit 2011. Physical limits of cells and proteomes. *PNAS* doi/10/1073/pnas.1114477108). Actually the average alone is not enough: you need to know the distribution around the average as well.

The argument goes like this. Because the mass of a cell is (on average, and excluding water) about 50% protein, the physical properties of the mixture of proteins that make up the proteome are likely to be important in dictating the physical properties of the cell itself. You might think this is a rather unhelpful idea: if you need to measure the properties of individual proteins one by one and average them all together to determine the overall behavior of the proteome, then it may be easier to measure the physical properties of the cell directly. But it turns out that many physical properties of proteins depend strongly on their length. For example, the free energy of folding of a protein is directly correlated to the number of amino acids it’s made up of (let us, creatively, call this number N). While the details of the structure of the protein — secondary structure, the number of hydrophobic amino acids, the number of salt bridges, etc. — may be important for individual proteins, on average these details appear to have only a minor effect. This means that you can, in principle at least, figure out quite a lot about how a cell’s proteome responds to heat by simply knowing the relationship between N and folding free energy, and the average and distribution of N. Which, in principle, you can get from genomic information. Similarly, if you assume that proteins are in general globular, then the overall size of a protein depends fairly straightforwardly on N. That means that the rate of diffusion of a protein also depends on N. And if you know the distribution of N for a cell’s proteome, and the size of the cell, you also know something about the density of the intracellular environment.

So Dill et al. are suggesting, among other things, that you should be able to use sequence databases to predict the response of different cells to heat shock. They go further than simply suggesting that it should be possible: they set out to do it. First, they needed to figure out the relationship between N and the free energy of folding, ΔG. Since the free energy of folding of a given protein must be dependent on temperature, T, they use T as a variable as well. They use literature measurements of ΔG for 116 proteins to create two different approximations for the ΔG/N/T relationship, one for proteins from mesophiles (those of us who like to live at moderate temperatures) and the other for proteins from thermophiles (those who like to think they’re hot, and live at 45ºC or above).

Having done this, all we need to know is N to be able to determine ΔG for any given temperature. Using the mean and variance of protein chain lengths in the organism’s proteome, predicted from genome sequence information, you can get an approximation for this too. By putting the two equations together (mesophile ΔG/N equation with N distributions from mesophiles, and thermophile ΔG/N equation with N distributions from thermophiles, naturally), Dill et al. can then produce an estimate for the distribution of stability of proteins in a given proteome.

This is already interesting because the thermophile protein stability equation is different from the mesophile equation — so ΔG depends not only on N but also on the class of organism. And Dill et al. note that it isn’t entirely clear where the difference comes from. Nevertheless, within each class of organisms there seems to be a reasonable linear relationship between ΔG and N. So let’s just assume that all mesophile proteins behave the same way as each other, and take a look at a plot of the number of proteins versus stability in the genome of the biologist’s favorite organism, E. coli, at 37°C. It has a pronounced skew and looks like this:

What this shows is that although the average protein is predicted to be fairly stable at 37°C (with a free energy of folding of about 6.8 kcal/mol), there are a few hundred proteins that are predicted to be only marginally stable (free energy of folding < 3 kcal/mol). So for E. coli, even a small change in temperature — say 4°C — would be predicted to destabilize about 16% of the proteins in the proteome. Which would be bad; misfolded proteins are a problem, as we’ve discussed before. But just how bad would it be?

One way to get at the effect of a heat shock of a certain size is to look at the effect of the temperature increase on essential proteins. If the heat shock is large enough to kill the function of even one essential protein completely, then, by definition the organism is dead too. The authors were able to use the analysis above to work out an equation representing the probability that a given essential protein is denatured.

Can we test these predictions? We can, but we need to take the analysis one step further. We want to know whether the “death of essential proteins” prediction explains the heat sensitivity of an organism, and there is a confounding factor here: as temperatures get higher, enzymatic reactions become faster. So cells should grow faster with increasing temperature, until this increased growth rate is undermined by the fact that essential proteins start to become unstable. Dill et al. developed an equation describing the assumption that hotter enzymes run faster with the equation describing the death of essential proteins, and got a combined function that should describe the overall growth behavior of a cell as it relates to temperature. This equation can then be fit for two parameters, the enthalpy of folding (ΔH) and the number of essential proteins in the proteome, Γ.

Now we can look at whether the predictions match reality, and to my eye the match is rather startlingly good. Dill et al. predict growth rates for 6 mesophiles and 6 thermophiles across ~ 30 degrees of temperature, and compare their predictions to existing information in the literature about actual growth rates. Their predictions of the optimal temperature for growth look to me to be close to perfect, and even their predictions of the shape of the growth rate/temperature curves are pretty good. This is quite impressive given the long chain of reasoning and data-fitting that got us here, especially given that the authors are making no effort to worry about which proteins are abundant and which might not be expressed at all under the conditions of the experiment.

As an encore, Dill et al. go on to provide ways of estimating the internal viscosity of a cell based on the fact that the average protein has a radius that scales with N to the power of 2/5 (no, not 1/3, as you’d expect if they were spherical… we don’t know why, but that’s what the data say). Given this, and all the analysis of N distribution above, you can approximate diffusion rate. Dill et al. point out that the cell may face a tradeoff in terms of protein density: high density is bad, because it makes diffusion slower, but low density is bad because it means that proteins that have to interact with each other take longer to find each other. The optimum protein density for maximizing reaction speed that pops out of their analysis, 0.19, is very close to the measured number (0.2).

That’s not all: there are further interesting nuggets in this paper, including an estimate of how the rate of protein diffusion may affect the optimum size of cells, and the suggestion that the speed of protein folding may limit the maximum growth rate of E. coli. Although these calculations are not exactly back-of-the-envelope, the way this paper uses estimations of physical properties to illuminate important biological concepts reminds me of Richard Feynman’s famous tendency to work out concepts with BOTE sketches. It’s nice to think that we may finally be at the point where we know enough about the basic principles and parameters of biological systems to be able to use reasoning of this kind without going horribly wrong.

Wow. That’s really cool.

isn’t it? This is the kind of thing that the BioNumbers database was set up to encourage. Hopefully we’ll see more of it.

Thank you Becky for introducing me to the Bionumbers database. Did not know it exist. It looks very useful.