On Information Infrastructure in Biotechnology and Medical Research

The present information infrastructure in biotechnology and medical research is quite simply not up to the task at hand - dramatic advances in our ability to generate data have outstripped the development of processes and tools to make that data useful. This problem manifests itself in the form of lost opportunities, duplicated or wasted effort, and missed answers. In other words, progress in medical science is nowhere near as fast as it could be, just using the technologies of today. Fortunately, this truth is no great secret: everyone knows, and resources are being directed into the development of solutions.

On the process side, we have the move towards open access journals and other forms of open publication. The Public Library of Science is shifting it's weight into clinical trial data, for example:

Clinical trials - and particularly randomized trials - are critical in delivering reliable evidence about the efficacy of an intervention. Clinical trial data can also provide important information about the potential adverse effects of treatment. Currently, not all trials on human participants are reported in the peer-reviewed literature. PLoS Clinical Trials aims to fill this gap. The journal will broaden the scope of clinical trials reporting by publishing the results of randomized clinical trials in humans from all medical and public health disciplines. Publication decisions will not be affected by the direction of results, size or perceived importance of the trial. As an open-access journal, all articles published in the journal will be immediately and freely available online.

I see the main benefit to open publication strategies being the platform they provide for open-source models of development in automation, tools and processes of data management within the scientific community. If the data is free, the only cost to building a better utilitization of that data is your time ... and we've seen that this situation produces very impressive end results in software development. The closed journals - and their business models - are a roadblock to that sort of progress, and I think that roadblock is becoming a real problem in the information-rich fields of biotechnology and medicine.

On the technology side, a range of tools are under development. The informational side of biotech looks a lot like the open source movement of ten years ago - many competing standards, lots of good ideas and a real froth of software. An example of the sort of tools I'm talking about is the work of Butte and Kohane:

"Nearly 100 different diseases have been studied using microarrays, spanning all of medicine. This is a new way to explore this type of data. We can study virtually everything that's been studied." Butte is the first author of the study, which is published in the Jan. 6 online issue of Nature Biotechnology.

The advance comes with a caveat, however: clinically useful nuggets will be buried under the avalanche of data inundating international repositories each year unless scientists come up with a way to better classify their experiments and results.

"Libraries figured out a long time ago how to classify items using the Dewey decimal and other systems," said Butte, who estimates that the contents of the databases are more than doubling each year. "We need to write software now that will help scientists assign the proper concepts to each experiment."


Butte and his Harvard co-author, Isaac Kohane, MD, PhD, used computer programs to automatically categorize the tens of thousands of microarray experiments in a single database based on the terms, or concepts, used by the submitter to describe the experiment. They then looked for findings shared by several experiments with similar concepts, such as tissue type, for example. Comparing results from many similar experiments allowed them to identify correlations that may not be statistically significant in just one experiment.

If progress in biotechnology and medicine is to continue at the present healthy pace - especially in very complex problems that span many comparatively isolated fields, such as addressing age-related degeneration - then the research community must successfully deal with the problem of data management.

Technorati tags:


My question is where does the need for better information technology (software) stop being the specialized tools that these guys specifically need and start being software generally useful across disciplines, so that someone in the general open source movement can understand and be motivated by the usefulness to himself/herself to hack on it.

Or is the problem that the technology (software) does exist, it's just that every biotech lab in the world is re-inventing the wheel, writing the same exact software instead of sharing.

I like to hope that there's software that is useful to medical research but also to regular small enterprises or computer science folks. Improving open source databases like PostgreSQL might fall under this category.

Posted by: tomo at January 17th, 2006 7:26 PM
Comment Submission

Post a comment; thoughtful, considered opinions are valued. New comments can be edited for a few minutes following submission. Comments incorporating ad hominem attacks, advertising, and other forms of inappropriate behavior are likely to be deleted.

Note that there is a comment feed for those who like to keep up with conversations.