Does anyone have a good (free, open source) software solution to reduce the redundancy of a set of sequences (eg return a set where no two sequences are more than 90 % identical, based on a pairwise alignment) ?
I want something something like the ExPASy "Decrease redundancy" tool, but capable of handling larger sequence sets (even if the compute time does scale poorly due to calculating all-against-all pairwise alignments). Ideally this would run locally on Linux/Unix/whatever. Jalview can reduce redundancy, but it relies on a multiple alignment which doesn't suit my purpose.
This problem pops up for me occasionally and I've never found a piece of software that does this "out of the box", except the aforementioned the web-based tool. I'm about to write my own ... but it really seems like an obvious operation that already should exists as part of some tool out there ... somewhere.
Suggestions ?


I second cd-hit
CD-HIT works for me. Simple as "cd-hit -i file -o file90 -c 0.9 -n 5".
Sorry about your comments Pawel, they went to spam for some reason.
Thanks !
Thank you Pawel and Neil.
CD-HIT is exactly what I want, and it works great. The algorithm appears to take some smart short-cuts and reduces calculation time from would have taken many hours with my half written brute-force all-against-all pairwise alignment script down to seconds. If CD-HIT is good enough for UniProt, it's good enough for me :)
Blastclust looks like another good option (it was even already installed on my machine ... right under my nose), but is orders of magnitude slower, so I'm sticking with CD-HIT for my quick-n-dirty tasks.
Lucky I didn't go very far coding something to do this myself (although thinking about the problem was enlightening). I've got to remember to listen to my internal "stop-immediately-you-are-reinventing-the-wheel" alarm more often.
You're welcome
You're welcome and apologies for omitting "-c 0.9" in my first comment. I'm sure you worked it out.
cd-hit, blastclust
I'm not sure if my previous comment was saved... Anyway, I will point again to cd-hit and blastclust. However, the first does not make any alignments, and allows some redundancy in the set anyway.