Bio Data Integration

With the announcement last week that LION is planning to sell off its bioinformatics business including SRS, a lot of labs are or will be looking for new ways to handle biological data integration. For those not familiar with SRS it is software package that parses and indexes the majority of the biological data sources that exist today and allows you to search all of them through a standard interface. It has the ability to use links between data sets to pull lots of disparate pieces of information together in result sets and generally eases a lot of data warehousing problems. SRS is a commercial product but is available for free to Academic institutions and non-profits. So the question I would like to pose to the nodalpoint community is what have you found to work for your data warehousing needs? Read on for more about data integration...

There are a few public systems that have surfaced in recent years that do data integration like Atlas, SeqHound, and BioMart but I haven't seen much use of these systems outside of their home labs yet. BioMart has started to be implmented by a few outside groups because of its affiliation with GMOD but its adoption is still in the early phases. Have any users worked with these or any other systems to address their data integration needs? I would be interested in hearing from you if you have.

One of the problems I have with these systems is that you are limited to the data sources they choose to integrate. While the data sources they package with them is significant, plugging in my own data sources is something I do fairly often and thus it ranks pretty high on my feature wish list. After quick reads through the docs of Atlas and SeqHound, it appears that integrating your own data source would take a lot of heavy customization to implement (please correct me if I am wrong). BioMart on the other hand does allow you to plug in your own data sources but it requires that the data be "re-formatted" to fit their data model. This isn't necessarily a bad thing but it does increase the data managment burden compared to systems that are able to deal with the data in their native formats.

The other possibility these days is to take advantage of web services that are being exposed for bioinformatics like the NCBI's Eutils project. The types of problems you see with this approach include lack of serivces for some data sources, performance when dealing with large data sets, and implementation differences between services. Performance is the major killer for me especially when you are trying to do automated analysis pipelines or anything else that requires the best performance you can get.


Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Integration is always case-by-case?

In other words biological data and the demands (queries) placed upon biological data are to diverse and too specific for a one-size-fits-all approach to ever work generally

Exactly.

I've been pondering this thread all week and I'm finding it hard to summarise my thoughts on this very important issue. So here's a random selection of things that have gone through my head.

  • In 5 years as a practising bioinformatician, I have not once found a use for SRS.
  • Thanks to the original poster for some useful links - Atlas looks very promising at first glance.
  • I'm with Greg - data warehousing (keeping a lot of disparate data sources in one place with a common interface) is very different to data integration. When I think of data integration, I think of a process such as annotating a newly-sequenced genome. For each feature of interest, I want to gather various data - its coordinates, sequence, motifs, BLAST hits to various databases, structural data, PubMed references and so on. A nice system for this is a GFF file, stored using Bioperl's Bio::DB::GFF and accessed via Generic Genome Browser. For other types of project, I might want other types of data. In other words, what to integrate, what data to obtain and how to process the data is often project-specific. For me, a lot of the skill in bioinformatics is in knowing where to obtain primary data and devising new and clever ways (i.e. scripts) to process and display data.
  • I like a flat file, myself. PDB, GFF, GenBank, dssp output, whatever. More often than not there's a Bioperl tool for the job - if not, I can easily whip up a parser for plain text. I appreciate that a database schema can help with storage, organisation and searching and have benefited a lot from good schema design (Bio::DB and biosql). Yes it's an overhead, but worth considering if you're doing a lot of similar work over time with similar files.
  • <Greg flamebait> ;-)
    Markup on the other hand leaves me cold. For one thing, what about download costs? I'm doubling the size of my file by adding markup to it. For machine-readable, I much prefer to get a raw file and devise the tools to deal with it at my end - particularly as the PTB seem unable to decide on markup standards. And what if I don't like their markup standards? Yes, give me a flat file anyday. I'm a flat file kinda guy.

    </Greg flamebait>


Data warehousing vs. data integration

All the systems you mention are what I would consider data warehousing systems using a kind of one-size-fits-all approach. Having data in the one place and accessible via consistent interfaces is a good thing. However as you have clearly outlined the problem is that these systems are not very good at data integration.

When I think of data integration I think more in terms of taking arbitrary data sources, merging common objects (the identity problem all over again) and then finding new links between those objects via generation of new facts (inference) and query systems.

In other words biological data and the demands (queries) placed upon biological data are to diverse and too specific for a one-size-fits-all approach to ever work generally (I am not dissing the systems you mentioned, they are what they are). More flexible data models and database systems are necessary.

The most promising solution IMO is use a graph data model to underly your data integration system. This is in fact the basis of the semantic web, which I'm sure you are aware of given your previous post on LSID identifiers. I'm not aware of any mature publically available systems that implement this kind of graph data model for integrating biological data. There is some evidence in the literature of mature systems, and the beginnings of such as system implemented by Christopher Lee's group in python. The semantic web for life sciences is producing some promising discussion on these types of systems.

Another thing to consider, related to the web services solution, is the 'biological microformat'. I'll define microformats as small single purpose schemas designed to be easy to implement (i.e. easy for a human to understand). The importance of the microformat will not be seen in large bioinformatics shops (EBI, NCBI etc) but in smaller laboratories. Think of it as all data structured all the time. So each lab produces a constant stream of data using microformats to mark up their results, data etc. and aggregators then go out and collect that data (yes, rss for science is exactly what I'm talking about) Evidence of microformats: search the journal of bioinformatics for the key words "markup language" and you'll see what I mean.

ATEOTD there is currently no magic that will solve this problem. You're looking at a long hard road to bioinformatics nirvana where all data is available in machine readable/understandable formats...


BioMart for integration

Hi Josh,

While I am not too familiar with Atlas and SeqHound I would like to point out that BioMart is rapidly maturing and not so hard to configure.

For the Biomart schema, data reformatting can be as easy as creating "views" in your Oracle/MySQL schema. You wouldn't have to transform everything in your dataset or, worse, duplicate your data. With respect to analysis pipelines and automated analysis, the Biomart API should be a plus. It could plug into your bioperl scripts.


Update on SeqHound

FYI,
Since I posted this story, the people behind SeqHound (Blueprint.org) have announced major layoffs and will close their doors in a few months. I couldn't find any comments about how this affects SeqHound but my assumption is that development will cease. BIND on the other hand is moving to one of their sites in Singapore and will be maintained.


SeqHound continues to serve the community

Hello,

In response to earlier concerns about the future of SeqHound, please let me assure the community that while The Blueprint Initiative retools its Toronto operations and enhances its Singapore operations, SeqHound will remain fully operational and supported, as will the other tools and databases that we currently support, including BIND.

Blueprint North America is going through a period of transition, and we appreciate everyone's continued support. It is our fervent hope and full expectation that the user community will see no disruption in the level of service you have come to expect from Blueprint during this period.

If any issues or questions arise, please do not hesitate to contact us, as always, through the product-specific email addresses or info@blueprint.org.

Thank you,

Randall C Willis
Manager--Communications
The Blueprint Initiative


SRS and BioMart

Hi
I used to run an SRS server in an academic environment and thought it was absolutely essential - possibly one of the most powerful ready-made tools I could think of at the time. I am actually quite hopeful that selling off SRS might bring its price down (not great for the academic world but might become affordable for smaller commercial outfits - and who is to say that it won't remain free to academia). Perhaps a strategy of selling more packages at a reasonable rate could have been more successful.

In any case, I'll investigate BioMart just the same - thanks for the pointer.


RE: BioMart for integration

This is definitely true for data that is already in an RDBMS. However, I have a lot of data that is in flat file format and that I need access to as well. Since other apps rely on the flat file format I would need to duplicate the data to get it in a BioMart DB. This is really what I had in mind when I mentioned data management overhead.

It is clear to me that there are no one size fits all solutions yet or anything that comes close to SRS. So for now I plan to continue cobbling together various data access packages like BioMart to achieve the end result I need.

p.s.-Is this the mseewald I know from IU or are you someone different?


You are definitely right. Giv

You are definitely right. Given the current database chaos you will have a significant overhead. But, on the other hand, most things nowadays are "entrez gene" or ensembl-centric. So, once you have it parsed, matching it to the other data is not so hard. And if you build your Biomart dbs around these, many, many things might be quickly integrated.

Michael

PS: Yep, that's me! Good memory!! You can reach me under (AT)web.de I am back in Germany since August last year. Have been working in Austria for a while.

PPS: Please say Hi to Martin, if you meet him!