Counting phosphorylations: one, two, many…

June 6, 2011 § Leave a comment

Jeremy Gunawardena’s lab just published a paper that should probably be required reading for anyone in the habit of attempting to measure the relative levels of phosphorylated proteins using Western blots (Prabakaran et al. 2011.  Comparative analysis of Erk phosphorylation suggests a mixed strategy for measuring phospho-form distributions.  Mol. Syst. Biol. 7 482).  If you are in that category, be warned: you will find this paper depressing.

What Prabakaran et al. wanted to do was to find a way of determining the pattern of phosphorylations on a protein.  They chose the simplest situation possible — Erk, a protein with just two phosphorylated sites — and set out to develop a reliable method for finding out how much of the protein was phosphorylated at only site 1, how much at only site 2, and how much on both sites.

Did you realize that with all our technology, we still can’t do this?  Many people don’t. Quantitative mass spectroscopy techniques have recently made it possible to get a number for how much of the protein is phosphorylated at site 1 or site 2, but that still doesn’t tell you the distribution of the phosphoforms.  Suppose you have a protein that looks like this:

XXXS1XXX[cleavage site]XXXS2XXX

where S1 and S2 are the sites of phosphorylation. The [cleavage site], obviously, is the point at which the enzyme you’re using to chop the protein into peptides to run it on the mass spec acts.  When you analyze your peptides, you will have no idea whether the XXX[P]S1XXX peptides you see come from a protein in which just S1 is phosphorylated, or a protein in which both S1 and S2 are phosphorylated.  So, if you see 50% [P]S1 and 50% [P]S2, you won’t know whether this reflects a situation in which both sites are phosphorylated independently (leading to a mixed population of proteins with only S1, only S2, and both sites phosphorylated) or a situation in which S2 is only phosphorylated after S1 (50% of the protein is phosphorylated on both sites, and 50% not at all).  This could easily be biologically important, don’t you think?

« Read the rest of this entry »

Redefining optimal

November 16, 2010 § 3 Comments

When you’re trying to use models to probe the behavior of a complex biological system, there usually comes a point where you have to “fit parameters”.  This happens because the model is trying to build up a macroscopic picture from underlying features that may be impossible to measure.  For example, in the case of tumor growth, your model might use local nutrient density as a parameter that affects the rate of the growth of individual cells in the tumor and therefore the growth of the tumor overall.  But nutrient density might not be possible to measure, and so you would have to use experimental data on something that’s easier to measure (e.g. how rapidly tumors grow) to deduce how nutrient density changes across the tumor.  This might then allow you to make a prediction of what would happen in a different set of circumstances.  A good deal of work has gone into figuring out how to estimate model parameters from experimental data, because it’s difficult; you may have to computationally explore a huge space to test which parameter values best fit your data, and you may find that your experimental data can’t distinguish among several different sets of parameters that each fit the data quite well.  A recent paper (Fernández Slezak et al. (2010) When the Optimal Is Not the Best: Parameter Estimation in Complex Biological Models. PLoS One 5: e13283. doi:10.1371/journal.pone.0013283) highlights a disturbing problem of parameter estimation: the parameters you find by searching for the optimal fit between the model and experiment may not be biologically meaningful.

You might find this statement self-evident, and I’ll admit I didn’t fall off my chair either.  But bear with me, because this is a more interesting study than you may think.  What the authors do is to start with a model built by others that sets out to model how solid tumors grow when they don’t have a blood supply.  The model recognizes the fact that solid tumors are composed of a mixture of live and dead cells, and treats the nutrients released by dead cells as potential fuel for the live cells.  The question in the original model was how far a tumor can get in this avascular mode, and what factors lead to growth or remission. Fernández Slezak et al. aren’t interested in this question, though: they’re using the model as a test case to explore how easy it is to find parameters that match both the model and the experimental data.  This particular model has six free parameters (which is more than 4, and fewer than 30); this is manageable, though large.  I’ll mention two of the parameters, since they become important later: β, which is the amount of nutrient a cell consumes while undergoing mitosis; and C(c), the concentration of nutrient that maximizes the mitosis rate.  Many people would have had to settle for a rather sparse sampling of parameter space for this size of model, but because some of the authors work at IBM they had access to remarkable computational resources (several months of Blue Gene‘s compute time).

« Read the rest of this entry »

Using models to link different varieties of data

September 3, 2010 § Leave a comment

One of the fundamental problems in systems biology is that many important decisions get made at the level of individual cells, but most of the measurement techniques we have (mass spectroscopy, Western blots…) report data on the behavior of populations.  And much of the time it’s hard to relate what you see on the single-cell level to what you see at the population level. A recent paper (Pfeifer et al. 2010 Model-based extension of high-throughput to high-content data. BMC Syst Biol. 4 106.PMID: 20687942) aims to address this issue to some extent, by providing a method for comparing and merging data from microscopy and FACS analysis.

Even if this paper hadn’t said anything else interesting, I would still want to cover it because of the first sentence of the abstract, which states:

“High-quality quantitative data is a major limitation in systems biology.”

It’s so refreshing, given the dominance of the meme that biologists are drowning in floods of data, to see an acknowledgement that often the reverse is the case: we’re starving [I guess that should be thirsting, but it doesn’t have quite the same resonance] for the right data.  This is why much of this blog is about new ways to measure things, quantitatively, so that we can get a quantitative description of how the biological circuits we aim to understand are behaving.  It’s true that there’s a lot of low-quality (by which I don’t mean carelessly done, but low information content) data about; many brave souls are addressing the problem of extracting meaningful information from these large datasets.  But in case you were in any doubt, no, this is not the time for experimentalists to down tools and retrain as computational biologists.  The ability to make quantitative measurements is still a limiting factor in understanding biological systems.

« Read the rest of this entry »

Resources roundup

August 19, 2010 § Leave a comment

Having listened to Naama Geva-Zetorsky’s seminar yesterday, I felt bad that I hadn’t been advertising the wonderful resource she helped build in her time in the Alon lab.  So I’ve added it under the list of “databases and tools” links (Dynamic Proteomics). What you will get if you go there is a database of localization and dynamics on 1164 different genes (at the time of writing; this is, after all a dynamic database), tagged with YFP and studied in the H1299 non-small lung cell carcinoma line.  The YFP is inserted by exon tagging, and each labeled gene is therefore under its endogenous promoter.  You can look at images showing protein localization, with quantitation of nucleus/cytoplasm levels, and movies showing protein dynamics after exposure to the DNA-damaging drug camptothecin.  It’s a remarkable resource.

And perhaps it’s not a bad idea to say a few words about what else is under there.

BioNumbers is a project Ron Milo, Paul Jorgensen and Mike Springer started while sharing a bay in the Kirschner lab.  It’s a database that collects “useful” biological numbers — how much, where, how big, how fast — with references to the literature where the number was found.  Ron Milo recently published a sampling of the data, which I wrote about here.

DataRail is an open source MATLAB toolbox for managing, transforming, visualizing, and modeling data, in particular high-throughput data.  It was  developed in the Sorger and Lauffenburger labs, primarily by Julio Saez-Rodriguez and Arthur Goldsipe, with help from Jeremy Muhlich and Bjorn Millard. I wrote a little about what it has been used for here.

GoFigure is the Megason lab’s software platform for quantitating  4D in vivo microscopy based data in high-throughput at the level of the cell, which is being developed by Arnaud Gelas, Kishore Mosaliganti, and Lydie Souhait.  There’s a snippet more about it here.

little b is an open source language for building models that allows the re-use and modification of shared parts.  It also provides custom notations that make models easier to read and write. It was developed in the Gunawardena lab by Aneil Mallavarapu.

MitoCarta is an inventory of 1098 mouse genes encoding proteins with strong support of mitochondrial localization. The Mootha lab performed mass spectrometry of mitochondria isolated from fourteen tissues, assessed protein localization through large-scale GFP tagging/microscopy, and integrated these results with six other genome-scale datasets of mitochondrial localization.  You can search human and mouse datasets, and view images of 131 GFP-tagged proteins with mitochondrial localization.

Rule-based modeling is a rule-based language for modeling protein interaction networks.  It allows you to write general rules about how proteins interact, creating executable models of protein networks.  It’s based on the kappa language, orginally written by Jérôme Feret and Jean Krivine, working with Walter Fontana.

Do you know about tools that were developed to help understand biological systems at the cell/organelle/pathway level?  Send me an e-mail at becky[at]hms.harvard.edu and I’ll link it.  Thanks!

Friday Feature: Ceci n’est pas un film

June 18, 2010 § Leave a comment

It’s Friday — but this is not a movie.  One must be fair to the non-imagers among us (theorists, biochemists and such).  You are looking at a representation of the geometry of steady-state phosphoform distribution in a system consisting of two enzymes (a kinase and a phosphatase) and a substrate with two distinct phosphorylation sites, from the work described in Manrai, AK and Gunawardena, J.  2008. The geometry of multisite phosphorylation, Biophys. J. 95 5533-5543. PMC2599844.

Why would you want to know about this?  Post-translational regulation is a hugely important mechanism for changing the behavior or localization of proteins, and phosphorylation is possibly the most important form of post-translational regulation in eukaryotes. The number of phosphorylation sites on some of the proteins we study is staggering: the EGF receptor has 10, p53 has 16, and tau (the microtubule-associated protein in the fibrillary tangles in Alzheimer’s disease) has over 40.  Since a protein with n phosphorylation sites has 2(n) possible ways of being phosphorylated, the presence of different phosphoforms adds enormously to the complexity of the mixtures of proteins found inside a cell.

How does a biological system interpret this complexity?  p53 has an impressive variety of biological functions, but it’s hard to believe that the 2(16) different phosphoforms of p53 (>65,000) each have specific biological activities.  It seems much more likely that cells use some kind of readout of the overall distribution of phosphorylations — possibly it’s the concentration of proteins with phosphorylations above a certain level that matters, or maybe it’s something more complicated than that.  It’s hard to know without the tools to analyze phosphoform distributions.

Enter Manrai and Gunawardena.  I will tread very lightly over the ground they cover: I can’t reproduce the mathematical reasoning in this format (especially since I’ve just discovered that I can’t even do superscripts in this particular WordPress style), but the entire Mathematica notebook with the proof is available if you want it.  This is what they show:

« Read the rest of this entry »

Dream on…

June 14, 2010 § Leave a comment

The third and fourth DREAM5 challenges are now out. This earlier post gives a quick summary of the DREAM (Dialog for Reverse Engineering Assessment and Methods) project.  The challenges are listed below the fold.

« Read the rest of this entry »

Good luck, Julio!

June 14, 2010 § 1 Comment

This month, we say a tearful goodbye to Julio Saez-Rodriguez.  Julio is setting up his own group at the European Bioinformatics Institute (EBI) in Hinxton, near Cambridge UK, starting on July 1.  EBI is a great fit for him, with its focus on developing computational tools to serve the whole biological community.  There’s an added bonus in that he will be about 3,000 miles closer to his home town in Spain.

Julio came to the Department when the Sorger lab joined us from MIT.  He’s been a major driver of the Sorger lab’s efforts to develop logical models that represent real cellular pathways.  Much of the time, our view of a biological pathway is a kind of average, created by combining the lists of interactions identified in dozens or hundreds of cell lines.  But the pathway doesn’t have to be the same in every cell type — the differences in behavior between cells have to come from somewhere, and by now we know that there are not enough genes in the human genome to allow each cell type to have whole pathways that are different from other cell types.  When Julio looked at liver cells, for example (with Leo Alexopoulos), he started with a consensus model assembled from the literature of the pathways controlling the response to seven different cytokines.  Using an extensive dataset of signal/response measurements taken exclusively from HepG2 cells, he asked how well the literature model could predict actual cell behavior.  What he found was that in order to make the consensus pathway work, he had to drop several “proven” interactions in the network — and add several that had indeed been observed in one cell type or another, but were not sufficiently consistently observed to get into the consensus model.

One day, the signaling community will realize that Julio has done them a huge favor.  Instead of fighting over who’s right and who’s wrong, everyone can be right — as long as you’re studying different cell types, or (if you get desperate) different variants of the same cell types.  Studying one cell type instead of another is like entering an alternate universe: anything can happen somewhere.

Julio plans to continue to work on developing models to help us understand the logic, and the cell specificity, of signaling pathways.   He’s hiring post-docs, and he’s also interested in collaborators and short-term visitors; you still have a week or so to chat to him about the possibilities.  (The UK is warmer than Boston in the winter, although it has to be said that it is also grey and drizzly.)

Good luck, Julio, and keep in touch!

Where Am I?

You are currently browsing the Data analysis category at It Takes 30.