Integrating BioPAX Compliant Pathway Data

I have been saying for some time now that RDF compliant data formats will lead to effortless integration of biological resources at the level of data model. I even presented a poster at ISMB last year along these lines, of course I presented my poster at the wrong session and nobody gave it a second glance. However in the back of my mind I knew people would get it eventually.

Well it seems that time is now. A group of clever individuals from Standford have aggregated BioPAX compliant data from kegg, ecocyc and reactome and built, surprise surprise, a Pathway knowledgeable. It is too early to tell whether this is the first step on along the path to a semantic web for life sciences. However it is a step in the right direction.

A short note on the use of aggregated rather than integrated in the previous paragraph. When you visit the project's web page the authors use 'data integration' rather than 'data aggregation' to describe the work they have done. The terms 'data integration' are usually used synonymously with 'heterogeneous data integration', thus I don't really see what they have done as being 'integration', in that sense as all the data was in the same format (BioPAX). Regardless, they are using RDF to do this and standards are a good thing (TM). While I am not completely convinced that the W3C semantic web recommendations are perfect, I would prefer to see people work with them rather than continue to invent self contained systems with little added value.

It will be interesting to see the paper once it emerges, so see if they had to deal with any semantic heterogeneity. Look out for personal semantic integration desktop software in the future.


Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Hi Greg

Hi Greg

I like your optimism, but I feel I have to point out that RDF is no magic bullet. Yes, it would be nice to do sparql queries on any kind of data source, but the trivial syntactic transformation to RDF is only about 1% of the work in data integration. At the very least you need high-quality orthogonal ontologies covering your domains of interest at the relevant granularity. Even then there are significant challenges ahead. Does your data represent real-world instances or classes, or knowledge about these things? Have you thought about open vs closed world, missing data, and data about absence?

There's a nice paper by Alan Ruttenberg of BioPAX presented at the Galway OWL meeting that is scratching away at the tip of this iceberg.

And lets not forget time, which is impossible to do right and retain normal RDFS and OWL semantics.

I have to disagree with you about Amarantha Gupta's work. I don't think this is self contained or has little value. On the contrary, we need people like Gupta who can see beyond the semantic web hype and produce serious working systems, as opposed to toy FOAF applications (which are nice, but only present a minute fraction of the challenge in representing complex biological systems)

I'd be happy to be proved wrong and see evidence to the contrary.

Cheers
Chris


I am optimistic about RDF

I am optimistic about RDF and OWL not for the reason that I feel they can solve the problems you mention, but more for the fact that they are open languages backed by the W3C and that they have a fighting chance of being widely adopted. What the work of the stanford people showed me was that this might actually be true.

With regards to Gupta's work, I'm not taking issue with its academic merit, I'm sure the software is great for storing biochemical pathways. It just doesn't add value for the adoption of RDF and OWL (it implements it's own graph representations, query languages etc.). Furthermore it is just not that interesting, since very similar work has been around for a while.