December 9, 2011 § Leave a comment
The protein folding problem, as it’s called, has been confounding biologists for decades. Unlike a strand of RNA or DNA, which can be relied upon to follow a few rather simple rules dominated by base pairing, a string of amino acids seems to have so many possible ways to interact with itself as to defy analysis. But the “problem” isn’t a problem for the protein — proteins fold, for the most part, rather efficiently, even in vitro where they have no help from the rest of the contents of the cell. So the information on what the fold should look like is sitting there in the sequence, and the question is how to read the code.
The protein folding field has tried many many routes to translate the information in protein sequence into three-dimensional structure, ranging from head-on attempts to use physics plus supercomputers to work out a protein fold from first principles, to efforts to harness the vast problem-solving potential of on-line gamers by translating the rules of folding into a game, to attempts to “cheat” by using evolutionary conservation to get an idea of which parts of the sequence are essential for the shape of the fold. There’s been progress on all fronts, and it’s now possible to use computational methods of various kinds, often informed by existing structural information, to fold small proteins. Although you need a fair amount of computer power to be successful, and better models for the complicated forces affecting a biomolecule in water are clearly needed, there’s a feeling that if we keep chewing away at the problem we will eventually be able to solve it.
A new paper (Marks et al. 2011 Protein 3D structure computed from evolutionary sequence variation. PLoS One doi:10.1371/journal.pone.0028766) now provides a rather startling step forward that dramatically reduces the need for major computational resources. You can now fold a ~250 amino acid protein on your ordinary laptop. The one apparent catch is that you can’t do this with just any sequence: you need a fairly large family of homologous sequences, of around 1,000 family members. Information derived from this family of sequences about changes in one part of the protein that correlate with changes in another part of the protein — covariance — is used to infer how close the two parts of the protein are to each other. This reduces the “conformational search space”, the number of three-dimensional folds you have to evaluate before settling on the best one, and that in turn not only speeds up the process of sorting through the possibilities. but also increases the chance that you will find the right answer.
Now, this is far from a new idea. It has been tried and tried and tried for years and has always failed. In fact one of the authors (Chris Sander) made one of the earliest attempts, about 15 years ago. Two things are different this time: the implementation of the idea, and the number of sequences available.