Database or flat text file?

As the deadline for my Ph.D. thesis looms over me, I've been working recently with a certain quantity of data of the same type (basically a variant of microarray gene expression) but from different sources and/or software.
Since I'm also doing a small portion of meta-analysis I tried to find a way to see common points or differences in the data: at first I parsed the CSV reports of the tools I used, but I quickly found it to be unbearable at best, so I settled to make a SQLite database in order to query appropriately the content, though creating and populating the database correctly was certainly an interesting experience, although it's paying off.

What about you? When do you think it's appropriate just to use flat files and when to use a small database? Or is using databases for small projects too much of an effort?


Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

For splicing, python dictionaries could be better than databases

Read this document:
- http://bioinfo.mbi.ucla.edu/pygr_0_5_0/seq-align.html#SECTION00011000000...

they say that databases could be complicated to use if you work with splicing or alternative splicing, or at least that you should look for some libraries.

Bye :)
--
http://genome.imim.es/~giovanni


Database or flat file? Could try XML

Its horses for courses, as people have said. If its a one-off hack, that nobody (including you) will ever want to re-run or re-use, flat files are the thing. However, this is rarely the case. If you don't want to go through all the hassle of learning about relational databases, you could try XML. It doesn't force you to come up with complex schemas from the beginning, but gives you powerful mechanisms for querying and manipulating your data (XPath, XQuery and XSLT). You can still have simple flat-files of course, but you won't have to write your own parsers, since there are tons of XML parsers out there to choose from. Just my $0.02 ... hope it helps.


XML is the hit!

However when you have a huge XML file X-Path could not be working as it needs to parse the entire document into tree structure for querying, and it could be slow. For a reasonably medium sized dataset, XML is a good substitution for database.


I also thought about XML,

I also thought about XML, but I found it quite complex, mostly because of the learning curve involved (which looked steep at the time I investigated). I tried to write something simple to parse Entrez Gene records (as a means to get a common ground for my datasets) in Python back then but I had to admit defeat due to lack of knowledge.


Try X-Path it is pretty

Try X-Path it is pretty straightforward. You need to use libraries. Perl and Python i believe they all have X-path libraries.


Databases or XML? Invest in your future...

There is quite a steep learning curve with both relational databases and XML. However, whichever you finally choose, they are both worth learning and will pay dividends in the long run.


It does indeed depend

Greg has it about covered. It's appropriate to use a database when it's inappropriate to use a flat file. And vice versa :)

If you don't need (a) structure and (b) queries for the data in your current project, then stick with flat file for the sake of speed. However, it's well worth learning databases at some stage in your career. The best way to learn new stuff is to apply it to one of your projects, rather than just saying "well, I'll teach myself one day when I have a moment". You'll never have a moment and trying to learn something in an abstract sense is much harder than in a practical sense, where you can see the relevance to your data.


There are three stages to

There are three stages to this. First you start out with a tab delimited text file. This is great for parsing using scripting languages, and writing simple queries, again using scripting languages. The problem is that at some point your intuition tells you that you should be doing more to structure your data, find relationships in your data etc. This then leads you to the never ending quest for the right database and the right schema. It is easy to dismiss relational databases at this point, when schema design and database choice side track you from your real goal.

My suggestion is to move on to stage two, use a relational database, but only for the added power that SQL gives you when querying a single table. It doesn't matter which one you use, MySQL is quite straight forward, and you don't need a complex schema. Just dump your existing flat file into a simple table that models the existing data types. Now you can use your scripting language's MySQL integration to pull out rows, query the table using conditions and limits etc. without having to write ad hoc query code in you analysis scripts.

If you think you can avoid the trap of finding the right schema/database, then moving on to stage three, and trying to model relationships in your data might be worth it. The setup time will cost you, but if you do it right, then you will have the full power of SQL to do your data mining. Some kind of visual modeling language can be helpful (e.g. UML) but relational database modeling is also an 'end in itself', try to avoid it at all costs, and focus on getting the best scientific value from your data.


Thanks for the suggestions.

Thanks for the suggestions. I thought about a database exactly because I had to infer relationships between my dataset and some others I obtained from the literature, and some of those relationships weren't exactly straightforward. I am into stage two at the moment, using SQLite as I don't want to pollute our MySQL database (we have an internal server with many dumps from the UCSC Genome Browser and other things).

I had to put things on hold right now (the deadline for the summary of the thesis is next month), but I'll resume soon. I'll give a look to UML for sure and decide if it's worth implementing something more complex.


database or flat file - it depends

It depends on what are your skills with computers and databases.

Actually I'm using flat files, because of three reasons:
- I don't have too much data and most of the files I have are fasta sequences of gffs;
- I don't know very well how to work with databases, I think that sometimes it's easier to use grep/perl scripts from the unix command line instead of interrogating the database every time.
- I'm too lazy to ask the systems managers to create a database for me, and I don't want to bring my laptop every day at work :(

Anyway I believe that if you know how to handle data in a database you should do it, read for example this post:
- http://www.bioinformaticszen.com/2007/02/bioinformatics-use-a-database-f...

---

http://dalloliogm.wordpress.com