Roundup: Extract a sequence from a fasta file

HMMs, SVMs, MCMC - interesting topics! But I have simple problems. Here is one: How do I extract some sequences from a fasta file if my a accession number are not Genbank ids themselved but other words that are still in the header?

Materials:
Hm. Pubmed won't export so many sequences from the web interface (at least I could not find a way, limit is 100) If I was a biologist, I would probably repeat the process manually 70 times to get 70 * 100 sequences. Which might have actually saved me a lot of time. But I wanted to be clever.

[ We also welcome here the Google users that just switched in and found this page in 2012 ]

Everything is fine if you just had a rather small fasta file and you knew the the accession ids you want. You can use seqret (EMBOSS) to get them from directly from GenBank or bp_fetch (BioPerl).

It's getting more complicated if you're searching not for ids but for parts of the description as in our case. You can try faFilter from the UCSC source tree but it's too slow for big FASTA files. I'm not the only one who has this problem, searching in mailing lists reveal that Malcolm Cook suggested a little Perl script to do exactly that.

But I have a big fasta file from Genbank, one gigabyte. I can index it first with formatdb (NCBI BLAST) and pick out the sequences with fastacmd or use BioPerl, index it with bp_index.pl and use bp_fetch.pl to get the sequences. But this still only searches the first word of the header, not the others.

So, welcome to the great world of EMBOSS. No simple commands anymore, USA is something else, man is not man but tfm ("what the f***?" you might think), and apropos: that's wossname. And yes, the N in the config file has to be really an N not an "n". (ahhh!) You can play around and add your fasta file with emboss.defaults, index your fasta file with dbifasta, try textsearch to get your sequences, waste half a day, just to find out that this is still much too slow. Hm. In another post in 2007, Malcolm (again, different mailing list, different year, a coincidence?) seems to have forgotten his script but suggests Lucegene which looks far too complex for the problem.

OK, another day, a better idea: Extract all headers with "grep '>' filesname", pipe this through sed to remove everything that doesn't look like some sort of accession id and reorder the two columns with gawk. This will finally create a list (IdInDescription, GenbankAccessionId ). Then I used join to translate my original list of IdInDescriptions to GenbankAccessionIds and could finally use Emboss to get my sequences. Malcolm's original script would have probably worked just as well.

Conclusions: Emboss is from a different century. I stick with my old friends grep/sed/gawk/sort/join or learn more perl. And bioinformatics is a rather desperate business where just getting sequences can easily take two days.

[ I am looking forward to mails from Google users in 2012. Please tell me if you found this page useful. ]


Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

A couple of points

First, I don't believe there is a limit to the number of sequences that NCBI Entrez will export. I've retrieved tens of thousands in the past. So far as I remember, it's a case of choosing "save to file" from a set of search results.

Second, sequence retrieval is certainly harder than it should be. But you need to ask yourself if your initial approach is a good one. Do you really need to start with a 1 GB file or can you cut it down using some pre-filter? And sequence retrieval using descriptions is never a good idea. The fasta header contains 2 components: an ID (anything between the > and the first white space) and a description (the rest). The description is freeform. It can be whatever people want, or nothing at all, or a totally misleading annotation. This is why we have unique identifiers. Can you use some feature of the sequence itself for retrieval? Length, pI, presence of a domain detected using HMMER versus Pfam? Bioperl is very good at that kind of search and retrieval.


some points

Thanks Neil. I think there is no limit to the number of sequences you can download but a limit to the number of keywords you can search for. I did "id1 or id2 or id3 ..." and hit the limit there. There should be probably a better way to interface with NCBI but I don't know it.

Why 1 GB? Most EST-projects don't give their sequences genbank-compatible ids. They start with something homegrown, of course, then put the clones onto plates and distribute them. This is similar for mouse clones, zebrafish clones and also in my case ascidian cDNA clones. You see, a word in the description can be a real ID, if a sequence historically always had 2 different IDs. My EST sequences were just selected sequences, there is no way to filter from the sequence.

I think Badboys is right in that very often plain old Perl is more than enough. With hindsight, I could have hacked a python/perl script together in 20 minutes. It's so simple: Slurp the ids into a hash, iterate over the lines of the fasta and output everything that has a fasta id that contains a word in the hash. My problem was that I assumed that there was already a solution for this.

This happends to me all the time: I find programs on the internet that claim to do something and end up hacking something together 3 days later because the tools a) don't come with source code b) won't run at all c) do something slightly different which compromises the whole application for which I need them for.


simple solution

Let's get back to basics on this question. From what I've read above it seems as though we're not quite there yet when it comes to doing simple tasks on very large datasets. The bulk of what we use with bioperl, EMBOSS, etc, work great with 10-,100- thousands of sequences at most(depending on their sequence length ofcourse).

If you're trying to find sequences based on matching headers then a very simple perl script should do the trick. No bioperl and perhaps a pipe or to/from grep should get it done. Writing something that runs through the input file no more than once will also help. Something I find quite useful with gawk and perl are the ability to specify the input record separator, "$/" , and assigning it to something logical like "//" for Genbank entries.