pansapiens's blog

Clarifying simple things that end up complicated: Percentage Identity

A new paper by Raghava and Barton has just gone online "Quantification of the variation in percentage identity for protein sequence alignments" at BMC Bioinformatics.

Initially I was shocked .. how, in 2006, could anyone manage to publish anything original about percentage identity (PID), that simple but oft used/abused measure that is fundamental in the definition of the "twilight-zone" of sequence similarity (for infering structural similarity or relatedness by sequence alone).

Well, it turns out (and becomes obvious when you try to code it), that there is more than one way to calculate the PID of a multiple sequence alignment, and each method yields different results. Authors rarely state exactly which method they used and, not surprisingly, no matter how you chose to measure the PID the multiple alignment algorithm used also has a substantial impact.


Content Creation and Text processing

Liam Quin from W3C has given a few useful tips relating to processing documents (eg error-prone re-typed or scanned text) into XML.

Many of these practises are important for the sort of text processing tasks that seem to come up in bioinformatics.

Article summary: use lots of small one-off scripts to make small changes, continually validate your output, briefly document your steps, automate steps with a meta-script or Makefile and keep input and output text seperate (.. well duh!).


Syndicate content