<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xml:base="http://www.nodalpoint.org" xmlns:dc="http://purl.org/dc/elements/1.1/">
<channel>
 <title>nodalpoint.org - Sequence analysis - Comments</title>
 <link>http://www.nodalpoint.org/bioinformatics/sequence_analysis</link>
 <description>Comments for &quot;Sequence analysis&quot;</description>
 <language>en</language>
<item>
 <title>String-blasting</title>
 <link>http://www.nodalpoint.org/2007/06/17/blast_is_the_same_as_google_but_for_sequences#comment-3678</link>
 <description>&lt;p&gt;It&#039;s useful to apply BLAST-like techniques for searching over strings.  This has been used to find variations of the names of genes in text by several groups.  First, &lt;a href=&quot;http://www.yalepath.org/facultydb/id=KrauthammerM.htm&quot;&gt;Michael Krauthammer&lt;/a&gt;, when he was at Columbia, used base-pairs to encode arbitrary strings (they&#039;re usually encoding amino acids), then queries gene and protein names for matches in the text of journal articles.  &lt;/p&gt;
&lt;p&gt;BLAST is really just edit distance with some exclusion heuristics which don&#039;t work at all well on small strings.  So it&#039;s more natural to implement this notion directly, as &lt;a href=&quot;http://www-tsujii.is.s.u-tokyo.ac.jp/~tsuruoka/papers/acl03bio.pdf&quot;&gt;Tsuruoka and Tsujii&lt;/a&gt; did.   There&#039;s a nice description of the algorithms in &lt;a href=&quot;http://wwwcsif.cs.ucdavis.edu/~gusfield/&quot;&gt;Dan Gusfield&lt;/a&gt;&#039;s &lt;a href=&quot;http://www.cambridge.org/uk/catalogue/catalogue.asp?isbn=9780521585194&quot;&gt;string algorithm bible&lt;/a&gt;.  &lt;/p&gt;
&lt;p&gt;Our LingPipe software provides an implementation of approximate dictionary matching following Gusfield.  Here&#039;s a link to the class Javadoc:  &lt;a href=&quot;http://www.alias-i.com/lingpipe/docs/api/com/aliasi/dict/ApproxDictionaryChunker.html&quot;&gt;com.aliasi.dict.ApproxDictionaryChunker&lt;/a&gt;.  We provide Tsuruoka and Tsujii&#039;s distance metric as a constant, but the distances are plug-and-play.&lt;/p&gt;
&lt;p&gt;The really critical issue here is not just finding approximate matches of names of biomedical entities, but also disambiguating them.  The acronym &quot;ACT&quot; means a lot of different things in different contexts.  Figuring out which sense of a word or phrase is intended is a widely studied problem usually going under the heading of word sense disambiguation for common nouns or database linkage for proper nouns. This can either be done via unsupervised clustering, or by supervised database linkage if there are example contexts.  Luckily, databases such as Entrez and KEGG provide GeneRIFs which include pointers to articles about specific genes.  And evaluations like &lt;a href=&quot;http://biocreative.sourceforge.net/&quot;&gt;Biocreative&lt;/a&gt; are evaluating abilities of systems to figure out which gene is being mentioned in an article.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;http://www.colloquial.com/carp&quot;&gt;Bob Carpenter&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;http://www.alias-i.com/lingpipe&quot;&gt;Alias-i, Inc.&lt;/a&gt;&lt;/p&gt;
&lt;br class=&quot;clear&quot; /&gt;</description>
 <pubDate>Wed, 20 Jun 2007 13:01:43 -0400</pubDate>
 <dc:creator>Bob Carpenter</dc:creator>
 <guid isPermaLink="false">comment 3678 at http://www.nodalpoint.org</guid>
</item>
<item>
 <title>Flawed, but useful analogy: Bloogle</title>
 <link>http://www.nodalpoint.org/2007/06/17/blast_is_the_same_as_google_but_for_sequences#comment-3674</link>
 <description>&lt;p&gt;I think Google is still a useful analogy for explaining BLAST to wet bench biologists, even if it does have its flaws. As for &quot;Google isn&#039;t statistical&quot;, I disagree. What about all that statistics, probability and machine learning they use to builld and improve search results? Google (and other search engines) have very well defined metrics for measuring search quality, it&#039;s not all subjective. So despite its problems, search is still a handy analogy for describing BLAST that many people will be familiar with.&lt;/p&gt;
&lt;br class=&quot;clear&quot; /&gt;</description>
 <pubDate>Mon, 18 Jun 2007 07:29:40 -0400</pubDate>
 <dc:creator>Duncan</dc:creator>
 <guid isPermaLink="false">comment 3674 at http://www.nodalpoint.org</guid>
</item>
<item>
 <title>Analogy</title>
 <link>http://www.nodalpoint.org/2007/06/17/blast_is_the_same_as_google_but_for_sequences#comment-3673</link>
 <description>&lt;p&gt;Or, BLAST is like a microwave oven, except for sequences and not frozen burritos.&lt;/p&gt;
&lt;p&gt;But seriously, I think the Google analogy isn&#039;t very good because&lt;br /&gt;
1) You don&#039;t search using a subject in BLAST, but by using another sequence. If Google worked that way, you&#039;d give it a web page and it would find web pages similar to it.&lt;br /&gt;
2) BLAST is statistical, Google isn&#039;t. The only measure of how good a Google search is is the (subjective) opinion of the searcher.&lt;/p&gt;
&lt;p&gt;If you want an analogy, assuming the students know some bench biology, how about &quot;BLAST is an electronic Southern Blot&quot;?&lt;/p&gt;
&lt;br class=&quot;clear&quot; /&gt;</description>
 <pubDate>Mon, 18 Jun 2007 07:02:50 -0400</pubDate>
 <dc:creator>Jonathan_Badger</dc:creator>
 <guid isPermaLink="false">comment 3673 at http://www.nodalpoint.org</guid>
</item>
<item>
 <title>Solexa</title>
 <link>http://www.nodalpoint.org/2007/01/30/high_throughput_pyro_sequencers_for_the_price_of_a_pcr_machine_in_three_years#comment-3602</link>
 <description>&lt;p&gt;Solexa&#039;s technology produces even shorter reads - approx 30bp or so. It seems to be rapidly gaining in currency and analysis effort, if the Cold Spring Harbor meeting I attended earlier this month is any indication. A number of the large sequencing centers are choosing which way to split, and rather predictably both Solexa and 454 are getting attention. It will be interesting to see whether one emerges as a winner, or as suggested below, the combination will prove more powerful.&lt;/p&gt;
&lt;p&gt;It is certainly becoming clear that these technologies are only likely to be useful in resequencing and EST-like projects where reference genome scaffolds already exist for assembly. I&#039;m sure someone will come up with a clever work-around eventually, though.&lt;/p&gt;
&lt;br class=&quot;clear&quot; /&gt;</description>
 <pubDate>Mon, 21 May 2007 15:38:45 -0400</pubDate>
 <dc:creator>chris</dc:creator>
 <guid isPermaLink="false">comment 3602 at http://www.nodalpoint.org</guid>
</item>
<item>
 <title>More like these to come</title>
 <link>http://www.nodalpoint.org/2006/07/25/a_sanger_pyrosequencing_hybrid_approach_for_the_generation_of_high_quality_draft_assemblies_of_marine_microbial_genom#comment-3593</link>
 <description>&lt;p&gt;I think more and more people will be getting into this strategy at the beginning because 454&#039;s method much to be desired. On the other hand I think we should look carefully at how ABI is going to handle this new and emerging market. Their new high-throughput sequencer, SOLID, is based on the Agencourt sytem. &lt;/p&gt;
&lt;p&gt;What would be really cool is if we could get a machine that could interchange modes. Quite an instrumentation challenge because one technology is bead-based (454) - not sure about SOLID - and the other is clone based (Sanger).&lt;/p&gt;
&lt;br class=&quot;clear&quot; /&gt;</description>
 <pubDate>Fri, 11 May 2007 04:40:11 -0400</pubDate>
 <dc:creator>badboyz</dc:creator>
 <guid isPermaLink="false">comment 3593 at http://www.nodalpoint.org</guid>
</item>
<item>
 <title>Bead Events ???</title>
 <link>http://www.nodalpoint.org/2007/01/30/high_throughput_pyro_sequencers_for_the_price_of_a_pcr_machine_in_three_years#comment-3592</link>
 <description>&lt;p&gt;I work with this data every day and let me say that it&#039;s a bit of pain and pleasure. The data management issue is something most of us dont have a problem with, but the pain point comes with having to continually refine our analysis to handle data inconsistencies.&lt;/p&gt;
&lt;p&gt;On the whole 454 are making a great effort to deliver clean datasets to their customers. The longer reads do help although I&#039;m a bit wary of these homopolymers and phase errors.  And then there are these &quot;Bead Events&quot; that I keep on hearing about from various people.  Personally, I dont think that we should ever expect too much from a new technology that is trying to take massive strides too soon.  I like the idea of combining data sources together e.g Sanger + 454, Solexa + Sanger + 454, etc...&lt;/p&gt;
&lt;p&gt;Mostapha Ronaghi, the inventor of the first pyrosequencing system, gave quite an informative talk in Malaysia about technologies, the state of the art, cost issues, etc. It can be streamed online from&lt;br /&gt;
&lt;a href=&quot;http://www.mgrc.com.my/eLectureRonaghi.shtml&quot; title=&quot;http://www.mgrc.com.my/eLectureRonaghi.shtml&quot;&gt;http://www.mgrc.com.my/eLectureRonaghi.shtml&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Back to the data issue. Does anybody have more insights in what kind of inconsistencies or other things that people may need to be wary of?&lt;/p&gt;
&lt;br class=&quot;clear&quot; /&gt;</description>
 <pubDate>Fri, 11 May 2007 04:28:36 -0400</pubDate>
 <dc:creator>badboyz</dc:creator>
 <guid isPermaLink="false">comment 3592 at http://www.nodalpoint.org</guid>
</item>
<item>
 <title>Upgrade to 250bp in the works?</title>
 <link>http://www.nodalpoint.org/2007/01/30/high_throughput_pyro_sequencers_for_the_price_of_a_pcr_machine_in_three_years#comment-3319</link>
 <description>&lt;p&gt;As far as I know, the 454 people promised an &quot;upgrade&quot; of their technology to extend the maximum reads to 250 bp, I believe. Of course, I want to see it before I believe it.&lt;br /&gt;
However, I will be able to confirm in a few months, as the people responsible for the 454 instrument here want to perform that upgrade.&lt;/p&gt;
&lt;br class=&quot;clear&quot; /&gt;</description>
 <pubDate>Thu, 01 Feb 2007 02:42:28 -0500</pubDate>
 <dc:creator>lbbros</dc:creator>
 <guid isPermaLink="false">comment 3319 at http://www.nodalpoint.org</guid>
</item>
<item>
 <title>shorts reads</title>
 <link>http://www.nodalpoint.org/2007/01/30/high_throughput_pyro_sequencers_for_the_price_of_a_pcr_machine_in_three_years#comment-3317</link>
 <description>&lt;p&gt;short reads are less a problem if one has a close genome at hand that has already been sequenced... the more genomes are available the easier it is to simply assemble by alignment, which is similar to what ensembl is doing now with their 3x genomes (see their 2007 article in NAR).&lt;/p&gt;
&lt;p&gt;but anyways, yes, I was rather thinking about re-sequencing than sequencing from scratch.&lt;/p&gt;
&lt;br class=&quot;clear&quot; /&gt;</description>
 <pubDate>Wed, 31 Jan 2007 05:35:43 -0500</pubDate>
 <dc:creator>maximilianh</dc:creator>
 <guid isPermaLink="false">comment 3317 at http://www.nodalpoint.org</guid>
</item>
<item>
 <title>Short reads</title>
 <link>http://www.nodalpoint.org/2007/01/30/high_throughput_pyro_sequencers_for_the_price_of_a_pcr_machine_in_three_years#comment-3316</link>
 <description>&lt;p&gt;I just came out of a group meeting where 454 was discussed.  Something I hadn&#039;t realised is that the read length is very short - 100 bp or so.  This makes 454 great for resequencing, mapping reads onto existing assemblies and finishing, not so great for assembly of a genome from scratch.  I&#039;m excited by the idea that any lab could do a genome one day, but it won&#039;t be for a while yet.&lt;/p&gt;
&lt;p&gt;I vaguely recall a technology where single DNA strands are fed through nanopores and changes in electrical resistance used to read bases?  If anyone has a better memory than mine, feel free to comment.  That kind of technology strikes me as &quot;the way forward&quot; for truly high-throughput sequencing.&lt;/p&gt;
&lt;br class=&quot;clear&quot; /&gt;</description>
 <pubDate>Tue, 30 Jan 2007 20:19:32 -0500</pubDate>
 <dc:creator>Neil</dc:creator>
 <guid isPermaLink="false">comment 3316 at http://www.nodalpoint.org</guid>
</item>
<item>
 <title>Hackathons</title>
 <link>http://www.nodalpoint.org/2007/01/30/real_world_workshops_regcreative#comment-3315</link>
 <description>&lt;p&gt;This sounds great.  The BioPerl guys used to organise &quot;hackathons&quot;, but I guess they were restricted more to developers.  It would be useful to get developers and end-users together for brainstorming workshops.&lt;/p&gt;
&lt;br class=&quot;clear&quot; /&gt;</description>
 <pubDate>Tue, 30 Jan 2007 20:13:13 -0500</pubDate>
 <dc:creator>Neil</dc:creator>
 <guid isPermaLink="false">comment 3315 at http://www.nodalpoint.org</guid>
</item>
<item>
 <title>We have one of those</title>
 <link>http://www.nodalpoint.org/2007/01/30/high_throughput_pyro_sequencers_for_the_price_of_a_pcr_machine_in_three_years#comment-3314</link>
 <description>&lt;p&gt;I can say that handling the data from that beast is quite a task, according to my co-workers that rebuild and analyze the instrument&#039;s reads to get clean and (possibly gap-free) sequences.&lt;/p&gt;
&lt;p&gt;In my opinion pyrosequencing such as what the 454 system does will not be a viable option for many institutes until there is this large problem to tackle. At first we had just one person working on the data, but it became evident that it wasn&#039;t enough. Now we have three people dedicated to the data analysis, on different projects.&lt;/p&gt;
&lt;br class=&quot;clear&quot; /&gt;</description>
 <pubDate>Tue, 30 Jan 2007 18:16:15 -0500</pubDate>
 <dc:creator>lbbros</dc:creator>
 <guid isPermaLink="false">comment 3314 at http://www.nodalpoint.org</guid>
</item>
<item>
 <title>phylogeny programs</title>
 <link>http://www.nodalpoint.org/2006/04/15/algorithms_algorithms_everywhere_0#comment-3165</link>
 <description>&lt;p&gt;I should have added joe felsentein&#039;s list of phylogeny programs to that post. quote: &lt;a href=&quot;http://evolution.genetics.washington.edu/phylip/software.html&quot;&gt;&quot;Here are 267 of the phylogeny packages, and 32 free servers, that I know about.&quot;&lt;/a&gt; I admit that my example motif discovery is nothing against old disciplines like that...&lt;/p&gt;
&lt;br class=&quot;clear&quot; /&gt;</description>
 <pubDate>Wed, 11 Oct 2006 11:53:13 -0400</pubDate>
 <dc:creator>maximilianh</dc:creator>
 <guid isPermaLink="false">comment 3165 at http://www.nodalpoint.org</guid>
</item>
<item>
 <title>BLAT and PatternHunter</title>
 <link>http://www.nodalpoint.org/2006/09/29/10_000_times_faster_blast_in_java#comment-3164</link>
 <description>&lt;p&gt;thanks for the reference to PatternHunter... I&#039;ve heard about it before but forgot it here. &lt;/p&gt;
&lt;p&gt;Yes, it seems they simply put the database into memory. But well, given today&#039;s RAM prices that&#039;s not such a bad idea. BLAT does the same, with less sensitivity. And buying RAM is cheaper than buying some strange special hardware.&lt;/p&gt;
&lt;p&gt;Neil: a) PatternHunter is also &lt;a href=&quot;http://www.bioinformaticssolutions.com/products/ph/&quot;&gt;commercial&lt;/a&gt; though free for academic use.&lt;br /&gt;
b) You don&#039;t want to use BLAT in many appplications as it is a lot less sensitive than BLAST. I would rely on a blat only if the sequence is from the same or a very, very close species (like chimp or mouse to human) but that might also depend on the gene... So BLAT is really no alternative. Besides, BLAT is only free for academic use.&lt;/p&gt;
&lt;br class=&quot;clear&quot; /&gt;</description>
 <pubDate>Wed, 11 Oct 2006 11:27:17 -0400</pubDate>
 <dc:creator>maximilianh</dc:creator>
 <guid isPermaLink="false">comment 3164 at http://www.nodalpoint.org</guid>
</item>
<item>
 <title>How can we know?</title>
 <link>http://www.nodalpoint.org/2006/09/29/10_000_times_faster_blast_in_java#comment-3163</link>
 <description>&lt;p&gt;10 000 x does seem like an outrageous claim.  How are we to know?  The website is high on &quot;corporate speak&quot; and low on technical information.  &quot;More options than all other products combined&quot;, &quot; a new paradigm in biotechnology&quot;, &quot;the ultimate genomics search solution&quot; &lt;i&gt;etc&lt;/i&gt;.  The only way to test their claims is to obtain the product.  And that my friends is why open source is superior.&lt;br /&gt;
Itatsumaki has already noted the huge RAM dependency - 4 GB is hardly standard even on a modern laptop.  If you want a fast, in-memory BLAST alternative, why not try &lt;a href=&quot;http://www.soe.ucsc.edu/~kent/src/&quot;&gt;BLAT&lt;/a&gt;?&lt;/p&gt;
&lt;br class=&quot;clear&quot; /&gt;</description>
 <pubDate>Wed, 11 Oct 2006 00:34:51 -0400</pubDate>
 <dc:creator>Neil</dc:creator>
 <guid isPermaLink="false">comment 3163 at http://www.nodalpoint.org</guid>
</item>
<item>
 <title>RAM Dependent</title>
 <link>http://www.nodalpoint.org/2006/09/29/10_000_times_faster_blast_in_java#comment-3162</link>
 <description>&lt;p&gt;I looked at this, and it seems like they&#039;re just playing tricks.  Check out the RAM dependency they quote, it&#039;s huge.  They suggest your required RAM is 24 times the size of the database in letters, divided by the word-size.  So, for example if you try this algorithm against a large database like nt (non-redundant nucleotides) you would need a minimum of: 24 x 928,057,554 / 12 = 1.7 GB.  So it seems like they are making a speed/space trade-off somewhere.  Interestingly, that factor of 10k they quote sounds suspiciously like the difference between HD &amp;amp; memory access.&lt;/p&gt;
&lt;p&gt;It also seems likely that the played around with the seeding.  I think it&#039;s wrong to say that &quot;local alignments have been worked on for decades without a lot of speed improvement&quot;.  I think PatternHunter&#039;s asymmetric seeds are pretty convincing at doing just that, although I&#039;ve never benchmarked it myself.  Check out:&lt;br /&gt;
Bioinformatics. 2002 Mar;18(3):440-5.&lt;br /&gt;
PatternHunter: faster and more sensitive homology search.&lt;br /&gt;
Ma B, Tromp J, Li M.&lt;/p&gt;
&lt;br class=&quot;clear&quot; /&gt;</description>
 <pubDate>Wed, 11 Oct 2006 00:03:26 -0400</pubDate>
 <dc:creator>Itatsumaki</dc:creator>
 <guid isPermaLink="false">comment 3162 at http://www.nodalpoint.org</guid>
</item>
</channel>
</rss>
