Partial assembly from Traces

Hi, I have a question and I don't know a generic bioinfo-mailinglist where I could ask it: I wonder if there exists a program that does _partial_ assembly from shotgun sequences like those in trace-files (e.g. traces.ensembl.org). As a biologist, you often need only part of a genome that you're interested in. You have some similar sequence, the damn organism is even sequenced, but not assembled yet. Now, how the heck can you get a long sequence out of the trace archives?

OK, I am about to simply run blast, collect all the sequences that I get, clean vector contamination, assemble them with CAP3. With this kind of longer sequence, I can run blast again, clean, assembly, etc... and iteratively extend my sequence in to both direction, finally having my gene of interest complete from the trace archives. Without the need for a supercomputer. OK, now I just have to hack together this script.
Does it already exist? I guess I am not the only one who is hoping that a certain trace archives might come to the rescue....


Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

partial trageted assembly from trace archive

Hi there,

have u tried using tracembler available at . i am also trying to achieve the same goal as you and currently designed a very crude perl program which can do the job but takes lot of time. And mainly the time is taken for reading database file for extracting the sequences for which i m planning to use fastacmd. let me know if u need to use the copy of the script that i have right now. it will require few modifications but i can help u with that.


you may not need a supercomputer

Thanks for the Genotrace link - it does look useful.

Initially I wasn't sure that I understood your problem but I see your point - you only require perhaps a gene and some flanking sequence.

I guess the usual assembly tools (phrap, cap3, tigr assembler and so on) are not ideal for this job, as the whole notion of "partial assembly" is a rather unusual one. There's a TIGR tool called tgicl used for clustering and assembling a set of DNA sequences, which may be useful.

Really, your problem comes down to selecting a subset of traces that you believe are similar to your gene from the archive. You'll need to get all the traces, convert them to fasta format then use some rapid alignment method (maybe even BLAST) to see which traces match your sequence of interest. Then you'll need some tool to get just those traces - if you created a BLAST database, fastacmd is a quick way to extract sequences using their IDs or a Bioperl based script could be another way.

Often in bioinformatics, there is not one piece of software available for a task - you have to build a solution yourself, which is why some scripting ability is such a useful skill.

Finally, don't discount complete assembly. You may not need vast hardware resources, depending on how many traces you have and the repeat complexity and you may learn something interesting from the assembly. I'm currently running phrap assemblies on a microbial genome using 45 000 traces and it takes less than an hour on a very modest machine - a 2.66 GHz Celeron with 1 GB RAM.


Assembly and Supercomputer

Hi, thanks for the link to genotrace, thanks for putting this to the frontpage. I wrote to the authors of genotrace. I simply should have used the keywords "local assembly" instead of "partial assembly" and I would have found it in pubmed myself... sigh...

(keywords, keywords, everything is decided by keywords these days...)

Neil, do you have any idea how much computing power I'd need for assembling not a microbial genome with 45000 sequences but something 10 times bigger (ciona intestinalis)? Even if it takes like 100 times longer, that's still not very much...


hard to say in advance

It's hard to predict hardware requirements in advance for an assembly. If by 10 times bigger you mean 10 times more traces (450 000), you'd need a lot of RAM. Assuming a trace converts to a fasta file of about 1 kb (roughly the same as a kilobyte), you have to fit that plus various intermediate stage files and output files into available memory. On the other hand if you mean that the genome itself is 10x larger than an average microbial genome, that may be less important. The amount of repetitive sequence is also a factor - I've heard of cases where removing less than 10 repetitive reads allowed a previously failed assembly to run to completion on quite modest hardware.

Just glancing at the archive I see 5815065 traces for C. intestinalis and they seem of "average" length. I'd guess complete assembly of those would require a fast processor (or 2) and probably not less than 4 GB RAM (so 64-bit) to finish in a reasonable time.


Genotrace

I have found Genotrace to be of some service in this regard, depending on the current coverage of the genome

There is a web server hosted at still deemed 'experimental'. It probably doesn't host your organism of choice but you could try asking the authors to include your organism....???


grrrr....

Gnnaaarr! Playing around with genotrace for days now and can't get it to run... this is the usual story of a bioinformatics package... download it, configure, make -> won't work without fixin'. Will tell you later about the results...


Well it's published in

Well it's published in Bioinformatics. What do you expect ;)

"Fixing" is an essential bioinformatics skill. Try to enjoy your pain.


tracembler

This is more or less a copy of genotrace. Well, yes, they write a rebuttal in their paper "ours is different because bla" but well, they had better read nodalpoint before starting their project.
http://www.plantgdb.org/tool/tracembler/