Phylogenetics

If you had 100,000 DNA sequences, each 1000 nucleotides long, and you wanted to cluster them or create a phylogeny, which software would you use?

Also, what if they were amino acid sequences, rather than nucleotides?


Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

quicktree

I'd use MAFFT for aligning something this big.
quicktree was designed for building trees from large datasets like this (Pfam families).


MAFFT

Yep, MAFFT is fast and has excellent alignment quality. My last year's paper showed that it is the best for prtein alignment, even for distant sequences.


Phylip

Apart from CD-HIT which is a very good software, I would try using Phylip. The sequences are not large but the actual number of sequences is the problem. Try using a Neighbor Joining approach in Phylip, it won't be blazing fast but it will do the job, eventually.

I used Phylip to build a NJ tree of a set of 20000 protein sequences and it took me around 3-4 weeks to get it done on a 3GHz Xeon machine.


cd-hit

For clustering, CD-HIT is excellent. Very fast, handles many sequences. Used to create the non-redundant datasets in UniProt and at the PDB.

Phylogeny - I've never gone much beyond Clustal and Phylip, both of which would take hours on an average machine with any more than a few thousand sequences. I've heard good things about MrBayes - which is MPI-enabled, so could run on a cluster if you have access to one.


Huge!

Alf that is a huge requirement! Are you trying to make tree of life with 1000 base upstream of some house-keeping genes?
I think MUSCLE [ http://www.drive5.com/muscle/ ] can come to rescue, but you would need a good machine for sure. It uses log-expectation as profile function which is faster and accurate as well [ http://www.biomedcentral.com/1471-2105/5/113/table/T2 ]. General algorithm is http://nar.oxfordjournals.org/content/vol32/issue5/images/large/gkh340f2... .
More details in the paper http://nar.oxfordjournals.org/cgi/content/full/32/5/1792 .

______________________"The Answer Lies in Genome"______________________
http://computationalbiologynews.blogspot.com/