Fold your favorite protein in under an hour
December 9, 2011 § Leave a comment
The protein folding problem, as it’s called, has been confounding biologists for decades. Unlike a strand of RNA or DNA, which can be relied upon to follow a few rather simple rules dominated by base pairing, a string of amino acids seems to have so many possible ways to interact with itself as to defy analysis. But the “problem” isn’t a problem for the protein — proteins fold, for the most part, rather efficiently, even in vitro where they have no help from the rest of the contents of the cell. So the information on what the fold should look like is sitting there in the sequence, and the question is how to read the code.
The protein folding field has tried many many routes to translate the information in protein sequence into three-dimensional structure, ranging from head-on attempts to use physics plus supercomputers to work out a protein fold from first principles, to efforts to harness the vast problem-solving potential of on-line gamers by translating the rules of folding into a game, to attempts to “cheat” by using evolutionary conservation to get an idea of which parts of the sequence are essential for the shape of the fold. There’s been progress on all fronts, and it’s now possible to use computational methods of various kinds, often informed by existing structural information, to fold small proteins. Although you need a fair amount of computer power to be successful, and better models for the complicated forces affecting a biomolecule in water are clearly needed, there’s a feeling that if we keep chewing away at the problem we will eventually be able to solve it.
A new paper (Marks et al. 2011 Protein 3D structure computed from evolutionary sequence variation. PLoS One doi:10.1371/journal.pone.0028766) now provides a rather startling step forward that dramatically reduces the need for major computational resources. You can now fold a ~250 amino acid protein on your ordinary laptop. The one apparent catch is that you can’t do this with just any sequence: you need a fairly large family of homologous sequences, of around 1,000 family members. Information derived from this family of sequences about changes in one part of the protein that correlate with changes in another part of the protein — covariance — is used to infer how close the two parts of the protein are to each other. This reduces the “conformational search space”, the number of three-dimensional folds you have to evaluate before settling on the best one, and that in turn not only speeds up the process of sorting through the possibilities. but also increases the chance that you will find the right answer.
Now, this is far from a new idea. It has been tried and tried and tried for years and has always failed. In fact one of the authors (Chris Sander) made one of the earliest attempts, about 15 years ago. Two things are different this time: the implementation of the idea, and the number of sequences available.
The big problem with using evolutionary information to identify protein contacts has been that there are lots of reasons why two parts of a protein may co-vary, and there’s been no way to sort out which correlations are due to physical contacts and which are due to confounding factors — for example, selection for binding to a target protein or a substrate. Perhaps the most serious confounding problem is that indirect contacts can lead to covariance. If you have a situation where both A and C are in direct contact with B, then if B changes to B’, A and C may each have to change (to A’ and C’) in order to maintain protein function. This looks exactly as if A and C are in direct contact, but they’re not. The real contacts are a subset of the apparent contacts. The key advance in this paper is the use of a new approach to identify real contacts. Essentially, Marks et al. ask: can I find a small set of contacts that explain all the other contacts? In the ABC example above, there are three apparent contacts: AB, BC and AC. Any two of these would explain all three. In the case of a protein with 100 amino acids, there might be 600 apparent contacts. Using a sophisticated maximum entropy approach, Marks et al. show that a surprisingly small subset — approximately 50-75 — of these 600 are sufficient to explain all the rest.
The next leap is to treat each of these ~75 contacts as real — constraining the structure — and ask whether these constraints allow you to find an energetically reasonable protein fold. If so, we can next ask whether this fold bears any resemblance to real structures. Up until now we’ve only been talking about information from a family of proteins, but at this point the authors selected specific individual family members and simulated their folding. So the workflow goes like this:
• compare family of protein sequences to identify list of possible contacts
• use maximum entropy to rank possible contacts in order of probability of being real
• choose an individual protein to attempt to fold
• use varying numbers of the contacts from the family analysis as distance geometry constraints (c.f. NMR methods for structure determination), producing a set of possible structures
• rank the possible structures according to criteria that all proteins have to meet, e.g. the twist of an alpha helix must always be right-handed
• optimize the top-ranked structure of this individual protein by energy minimization
• determine how close the resulting predicted structure is to the published structure of that protein
Just to hammer home the point, this final step is the first time pre-existing structural information is used. The folds are predicted entirely from sequence data.
The authors show structure predictions for proteins from 15 families, ranging from a 71-amino-acid RNA-binding protein to the much larger trypsin (223 amino acids), to a membrane protein, rhodopsin. In most cases the deviation in atomic positions of the top-ranked structure from the crystal structure is less than 5Å RMSD (this refers to the root-mean-square deviation of the positions of the carbons in the main chain of the protein), and in some cases it’s around 3Å. For a moderate-sized protein, the whole process takes under an hour on a normal laptop. This is dramatically faster than previous methods; for example, the Rosetta @ home de novo prediction algorithm, which runs part-time on a virtual cloud of ~150,000 personal computers (volunteered by their owners) takes weeks or months to come up with quite small structures. Using evolutionary information correctly thus has the potential to make structure prediction enormously faster, and perhaps to complement experimentally-based methods such as crystallography and NMR. (Synthetic biologists may also be interested in the potential of this technique for predicting novel structures.) One of the most impressive aspects of the accuracy of the predictions is how good the predicted structures of enzyme active sites are: the trypsin active site is almost indistinguishable from the real thing.
All of this is very impressive, and I’m sure you’re itching to get started with your own favorite protein (YOFP). Marks et al. have offered to provide code to anyone interested, and plan to get a user-friendly website up and running within the next few months. But wait — before you get too excited, think about that one tiny snag. How many sequences are you going to be able to find for YOFP family members? How many do you need?
Helpfully, Marks et al. performed an analysis of how the accuracy of their predictions depends on the number of sequences in the database, and how the number of sequences available for each family has changed over the last 12 years. (See figure: average RMSD deviation shown in blue, number of sequences in red). The accuracy/number of sequences tradeoff is not the same for every protein, of course — larger proteins, and proteins with lower alpha-helical content, tend to require more sequences for decent accuracy — but there’s a clear correlation, and the number of family members you need seems to be always in the thousands. If the authors had tried this exact same method in 2001, when the number of protein family members in the database was much lower, their results would have been far less impressive (7-10 Å RMSD instead of 3-5 Å) and they might well have given up. If the protein family of YOFP isn’t large enough yet, don’t despair — at the rate that genomes are being sequenced, you probably won’t have to wait long.
For the purposes of truth in advertising, I should probably note that the main authors of this paper are all friends or acquaintances of mine, and Debbie Marks, the co-first author is the person I turn to most often when looking for a late-night dinner companion after leaving work. So, don’t take my word for it that this is a cool paper — by all means read it yourself and make your own judgment.
Marks, D., Colwell, L., Sheridan, R., Hopf, T., Pagnani, A., Zecchina, R., & Sander, C. (2011). Protein 3D Structure Computed from Evolutionary Sequence Variation PLoS ONE, 6 (12) DOI: 10.1371/journal.pone.0028766