<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xml:base="http://www.nodalpoint.org" xmlns:dc="http://purl.org/dc/elements/1.1/">
<channel>
 <title>nodalpoint.org - genbank - Comments</title>
 <link>http://www.nodalpoint.org/nodalpoint_tags/genbank</link>
 <description>Comments for &quot;genbank&quot;</description>
 <language>en</language>
<item>
 <title>some points</title>
 <link>http://www.nodalpoint.org/2007/08/06/roundup_extract_a_sequence_from_a_fasta_file#comment-4224</link>
 <description>&lt;p&gt;Thanks Neil. I think there is no limit to the number of sequences you can download but a limit to the number of keywords you can search for. I did  &quot;id1 or id2 or id3 ...&quot; and hit the limit there. There should be probably a better way to interface with NCBI but I don&#039;t know it.&lt;/p&gt;
&lt;p&gt;Why 1 GB? Most EST-projects don&#039;t give their sequences genbank-compatible ids. They start with something homegrown, of course, then put the clones onto plates and distribute them. This is similar for mouse clones, zebrafish clones and also in my case ascidian cDNA clones. You see, a word in the description can be a real ID, if a sequence historically always had 2 different IDs. My EST sequences were just selected sequences, there is no way to filter from the sequence.&lt;/p&gt;
&lt;p&gt;I think Badboys is right in that very often plain old Perl is more than enough. With hindsight, I could have hacked a python/perl script together in 20 minutes. It&#039;s so simple: Slurp the ids into a hash, iterate over the lines of the fasta and output everything that has a fasta id that contains a word in the hash. My problem was that I assumed that there was already a solution for this. &lt;/p&gt;
&lt;p&gt;This happends to me all the time: I find programs on the internet that claim to do something and end up hacking something together 3 days later because the tools a) don&#039;t come with source code b) won&#039;t run at all c) do something slightly different which compromises the whole application for which I need them for.&lt;/p&gt;
&lt;br class=&quot;clear&quot; /&gt;</description>
 <pubDate>Thu, 11 Oct 2007 13:01:13 -0400</pubDate>
 <dc:creator>maximilianh</dc:creator>
 <guid isPermaLink="false">comment 4224 at http://www.nodalpoint.org</guid>
</item>
<item>
 <title>A couple of points</title>
 <link>http://www.nodalpoint.org/2007/08/06/roundup_extract_a_sequence_from_a_fasta_file#comment-4197</link>
 <description>&lt;p&gt;First, I don&#039;t believe there is a limit to the number of sequences that NCBI Entrez will export.  I&#039;ve retrieved tens of thousands in the past.  So far as I remember, it&#039;s a case of choosing &quot;save to file&quot; from a set of search results.&lt;/p&gt;
&lt;p&gt;Second, sequence retrieval is certainly harder than it should be.  But you need to ask yourself if your initial approach is a good one.  Do you really need to start with a 1 GB file or can you cut it down using some pre-filter?  And sequence retrieval using descriptions is never a good idea.  The fasta header contains 2 components:  an ID (anything between the &amp;gt; and the first white space) and a description (the rest).  The description is freeform.  It can be whatever people want, or nothing at all, or a totally misleading annotation.  This is why we have unique identifiers.  Can you use some feature of the sequence itself for retrieval?  Length, pI, presence of a domain detected using HMMER versus Pfam?  Bioperl is very good at that kind of search and retrieval.&lt;/p&gt;
&lt;br class=&quot;clear&quot; /&gt;</description>
 <pubDate>Mon, 10 Sep 2007 08:23:59 -0400</pubDate>
 <dc:creator>Neil</dc:creator>
 <guid isPermaLink="false">comment 4197 at http://www.nodalpoint.org</guid>
</item>
<item>
 <title>simple solution</title>
 <link>http://www.nodalpoint.org/2007/08/06/roundup_extract_a_sequence_from_a_fasta_file#comment-4196</link>
 <description>&lt;p&gt;Let&#039;s get back to basics on this question. From what I&#039;ve read above it seems as though we&#039;re not quite there yet when it comes to doing  simple tasks on very large datasets. The bulk of what we use with bioperl, EMBOSS, etc, work great with 10-,100- thousands of sequences at most(depending on their sequence length ofcourse).&lt;/p&gt;
&lt;p&gt;If you&#039;re trying to find sequences based on matching headers then a very simple perl script should do the trick. No bioperl and perhaps a pipe or to/from grep should get it done. Writing something that runs through the input file no more than once will also help.  Something I find quite useful with gawk and perl are the ability to specify the input record separator, &quot;$/&quot; , and assigning it to something logical like &quot;//&quot; for Genbank entries.&lt;/p&gt;
&lt;br class=&quot;clear&quot; /&gt;</description>
 <pubDate>Mon, 10 Sep 2007 04:26:21 -0400</pubDate>
 <dc:creator>badboyz</dc:creator>
 <guid isPermaLink="false">comment 4196 at http://www.nodalpoint.org</guid>
</item>
</channel>
</rss>
