Redefining optimal
November 16, 2010 § 3 Comments
When you’re trying to use models to probe the behavior of a complex biological system, there usually comes a point where you have to “fit parameters”. This happens because the model is trying to build up a macroscopic picture from underlying features that may be impossible to measure. For example, in the case of tumor growth, your model might use local nutrient density as a parameter that affects the rate of the growth of individual cells in the tumor and therefore the growth of the tumor overall. But nutrient density might not be possible to measure, and so you would have to use experimental data on something that’s easier to measure (e.g. how rapidly tumors grow) to deduce how nutrient density changes across the tumor. This might then allow you to make a prediction of what would happen in a different set of circumstances. A good deal of work has gone into figuring out how to estimate model parameters from experimental data, because it’s difficult; you may have to computationally explore a huge space to test which parameter values best fit your data, and you may find that your experimental data can’t distinguish among several different sets of parameters that each fit the data quite well. A recent paper (Fernández Slezak et al. (2010) When the Optimal Is Not the Best: Parameter Estimation in Complex Biological Models. PLoS One 5: e13283. doi:10.1371/journal.pone.0013283) highlights a disturbing problem of parameter estimation: the parameters you find by searching for the optimal fit between the model and experiment may not be biologically meaningful.
You might find this statement self-evident, and I’ll admit I didn’t fall off my chair either. But bear with me, because this is a more interesting study than you may think. What the authors do is to start with a model built by others that sets out to model how solid tumors grow when they don’t have a blood supply. The model recognizes the fact that solid tumors are composed of a mixture of live and dead cells, and treats the nutrients released by dead cells as potential fuel for the live cells. The question in the original model was how far a tumor can get in this avascular mode, and what factors lead to growth or remission. Fernández Slezak et al. aren’t interested in this question, though: they’re using the model as a test case to explore how easy it is to find parameters that match both the model and the experimental data. This particular model has six free parameters (which is more than 4, and fewer than 30); this is manageable, though large. I’ll mention two of the parameters, since they become important later: β, which is the amount of nutrient a cell consumes while undergoing mitosis; and C(c), the concentration of nutrient that maximizes the mitosis rate. Many people would have had to settle for a rather sparse sampling of parameter space for this size of model, but because some of the authors work at IBM they had access to remarkable computational resources (several months of Blue Gene‘s compute time).
Fernández Slezak et al. used a set of synthetic growth curves extrapolated from experimental measurements made by others as their test data set, and set out to find parameters that fit the “experimental” data by brute-force computation. Essentially, this means guessing values for each of the six parameters, putting those values into the model, simulating the growth of a tumor, and comparing the predicted growth curve to (in this case synthetic) experimental reality. The optimal parameter set is then the one that minimizes the “cost function” of the sum of the differences between prediction and reality across all time points. (For the aficionados, they used four different methods for minimization of the cost function: Levenberg-Marquardt, parallel tempering, MIGRAD and downhill simplex. Of these, the Levenberg-Marquardt method did best at finding cost function minima and downhill simplex did worst.)
With a relatively dense sampling of parameter space, the authors were able to see just how rugged this cost function landscape is. Parameter values that are right next to each other can be dramatically different in how well they fit the data. There are lots of local minima, and no completely clear winners: each minimization method found several “best fits” that were essentially equal in terms of the cost function, each of which described the experimental data very well. But the parameter values for these “best fits” were not close together.
This is where β and C(c) come in. These particular parameters have been measured independently in the literature, although the authors scrupulously ignored this in their parameter-fitting exercise. So now it’s possible to compare the “best fits” from the parameter-fitting to actual experimental ranges for two of the six parameters. And it turns out that only a small subset of the values found as “best fits” fall within the ranges set by the experimental values, and (embarrassingly, perhaps) the very best “best fit” is not one of them.
What does this mean for the future of parameter-fitting? Fernández Slezak et al. remind us that fitting existing data is not the point: the point is to gain insight into how real biology may behave in circumstances that have not been, or cannot be, directly tested. This is not too different from the goal in machine learning, where you want to use a training set to teach you something about how to approach new data. But in machine learning it’s well known that an algorithm that shows good performance on the training set can show very poor predictive ability; this happens when the model is “overfitted” to the data.
How are we going to avoid the problem of overfitting in biological models? The authors propose that one way to approach this is to use a subset of the available experimental data for the first round of parameter-fitting, and reserve the rest of the data to test the model. This is similar to the cross-validation techniques used in statistical learning, or the R(free) test used by structural biologists. This approach probably won’t work very well if we simply pick the “best fit” from the first round of parameter-fitting and ask whether it fits the new data, however; instead, we’ll need a way to define a set of “reasonable fits”. Instead of minimizing the cost function, we may only need to get it below a certain level. Then the question will be, which set of parameters is “reasonable” for all the data sets available?
Since the model itself is undoubtedly not a perfect representation of reality, it may be pointless to look for a perfect fit between model and experiment. As a wise person once (almost) said, it’s important not to let the perfect be the enemy of the good.
Fernández Slezak D, Suárez C, Cecchi GA, Marshall G, & Stolovitzky G (2010). When the optimal is not the best: parameter estimation in complex biological models. PloS one, 5 (10) PMID: 21049094
I recently became aware of a fascinating approach to minimizing the number of free parameters used in models. It is known by several names (each one denoting something a bit different): Lasso method, compressed sensing, L1 minimization etc.
I was introduced to it by Avi Flamholz, a student that recently joined the lab after working at Google and it turns out to be very useful in many applications, from sequencing and microarray analysis to image processing. The ideas behind it are well explained in the following video lectures by Emmanuel Candes from Caltech (http://videolectures.net/mlss09us_candes_ocsssrl1m/).
One intuitive ingredient is to give a penalty for using many parameters and to use the information that the model is sparse, i.e. not all possible variables have a significant effect, but there is much more to it, and its analytic strength seems almost magical.
Below is my response to Ron from when we spoke about this paper over email. Thought it was appropriate for sharing:
The general problem is that real-world data is noisy, variables are often correlated, and computational models make simplifying, partially true assumptions. Minimizing model error alone (even prediction error) can only optimize a flawed model for flawed data. In a broad sense, the way to address this problem is to impose some realism on your model. The paper (as I skimmed it) suggests doing so by optimizing the model less and manually analyzing a larger swath of parameter-space, but you can also do things in the other order: analyze more up-front and modify the model to better accord with reality.
Regularization is the idea of imposing a penalty function on top of a fit function. The penalty, presumably, helps the model fit some understanding of “reality.” The Bayesian approach of imposing a prior distribution on the model-building process is a slightly more flexible way of achieving the same goal, if a bit more complicated. L1 norm minimization (which Ron mentioned) is a type of regularization that is particularly relevant to biological analysis because it drives towards parsimony, but it is not always the most appropriate method.
No matter how you inject “reality” into your model, you’ve got to figure out what you think “reality” means. This means making a judgment about a physical/chemical/biological system, which computationalists are sometimes wary of (myself included). This, I think, is the biggest gap of all.
There’s really no way around “injecting” reality into the model. A number of results in computational learning theory collectively known as “No Free Lunch” theorems basically say that across all possible worlds, all learners are equal. What makes some learners stand out (in our world), is that they make assumptions which hold true. It doesn’t have to be subjective. There’s are several universal laws (averaging laws, scaling laws) that have been empirically confirmed, so you could bias your learner to incorporate such assumptions. In practice it could mean doing things like preferring highest entropy distribution out of all distributions that fit data equally well. Here’s Terry Tao’s article on universal laws http://terrytao.wordpress.com/2010/09/14/a-second-draft-of-a-non-technical-article-on-universality/