A recent piece of correspondence in Genome Biology (Parkhill, J, Birney, E and Kersey, P. 2010. Genomic information infrastructure after the deluge Genome Biology 11 402) discusses the fact that the ability of the scientific community to maintain well-curated, up-to-date reference genomes is failing in the face of the flood of new sequence information. The authors point out that, while our ability to obtain sequence data has been rapidly increasing — and the ways we use sequence data have been proliferating — there has been no corresponding change in the community’s ability to store, organize and interpret these data. As a result, many genomes were annotated once when they were first submitted to the public databases and have never been updated; the group that did the sequence moved on to another challenge, and there has been no organized attempt to curate the information from experiments enabled or informed by the genome sequence.
It’s not a pretty picture, and there is every reason to think that (without intervention) it will only get worse. Parkhill et al. point out that a major problem of the current model is that it is hard for funding agencies to figure out how to interact with the patchwork of existing resources: it’s hard to determine whether a resource would emerge anyway, without the help of a particular agency or grant; it’s hard to know whether a specific resource offers good value for money; and it’s unclear how long-term funding can be accomplished. The problem is exacerbated by the fact that these resources are, and should be, international, and they are therefore exposed to the shifting winds of enthusiasm for science funding from many directions.
The authors propose a three-tier model for database curation which they believe would solve at least part of the problem, namely the need to make sure that bioinformatics resources are kept up to date and organized in an accessible way. The tiers have different functions — collecting experimental data, curating the data around a specific set of organisms (for example a set of related species), and top-level integration to provide a complete resource covering everything that’s available. They argue that funding bodies should explicitly plan to fund inter-tier integration. Their suggested plan looks eminently sensible to me, though I’m no expert on the needs of the community; and I hope that the initiative the authors have taken will lead to a wider discussion on the role of computational datasets in biology. For example, imaging datasets (e.g. those for the Virtual Fish, or images from high-throughput image-based screening), and mass spec datasets are increasingly moving from within-lab resources to community resources — and then there are datasets of direct medical interest, such as CT and MRI image sets. The challenges of organizing and sharing such datasets include the purely physical (e.g. network bandwidth) as well as the scientific (what was learned from this, and how do I share it?) The genomics community may have faced the challenge first, but other communities are not that far behind.
The big issue here is that the importance of computers in biology and medicine has been increasing so rapidly that funding agencies, institutions and individual scientists are struggling to keep up. I see that the narrower issue of computational infrastructure has been getting some attention from a subcommittee of the US House of Representatives’ Committee on Science and Technology, which is something. What happens as a result, well, we shall see.