<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xml:base="http://www.nodalpoint.org" xmlns:dc="http://purl.org/dc/elements/1.1/">
<channel>
 <title>nodalpoint.org - Roundup: Extract a sequence from a fasta file - Comments</title>
 <link>http://www.nodalpoint.org/2007/08/06/roundup_extract_a_sequence_from_a_fasta_file</link>
 <description>Comments for &quot;Roundup: Extract a sequence from a fasta file&quot;</description>
 <language>en</language>
<item>
 <title>some points</title>
 <link>http://www.nodalpoint.org/2007/08/06/roundup_extract_a_sequence_from_a_fasta_file#comment-4224</link>
 <description>&lt;p&gt;Thanks Neil. I think there is no limit to the number of sequences you can download but a limit to the number of keywords you can search for. I did  &quot;id1 or id2 or id3 ...&quot; and hit the limit there. There should be probably a better way to interface with NCBI but I don&#039;t know it.&lt;/p&gt;
&lt;p&gt;Why 1 GB? Most EST-projects don&#039;t give their sequences genbank-compatible ids. They start with something homegrown, of course, then put the clones onto plates and distribute them. This is similar for mouse clones, zebrafish clones and also in my case ascidian cDNA clones. You see, a word in the description can be a real ID, if a sequence historically always had 2 different IDs. My EST sequences were just selected sequences, there is no way to filter from the sequence.&lt;/p&gt;
&lt;p&gt;I think Badboys is right in that very often plain old Perl is more than enough. With hindsight, I could have hacked a python/perl script together in 20 minutes. It&#039;s so simple: Slurp the ids into a hash, iterate over the lines of the fasta and output everything that has a fasta id that contains a word in the hash. My problem was that I assumed that there was already a solution for this. &lt;/p&gt;
&lt;p&gt;This happends to me all the time: I find programs on the internet that claim to do something and end up hacking something together 3 days later because the tools a) don&#039;t come with source code b) won&#039;t run at all c) do something slightly different which compromises the whole application for which I need them for.&lt;/p&gt;
&lt;br class=&quot;clear&quot; /&gt;</description>
 <pubDate>Thu, 11 Oct 2007 13:01:13 -0400</pubDate>
 <dc:creator>maximilianh</dc:creator>
 <guid isPermaLink="false">comment 4224 at http://www.nodalpoint.org</guid>
</item>
<item>
 <title>A couple of points</title>
 <link>http://www.nodalpoint.org/2007/08/06/roundup_extract_a_sequence_from_a_fasta_file#comment-4197</link>
 <description>&lt;p&gt;First, I don&#039;t believe there is a limit to the number of sequences that NCBI Entrez will export.  I&#039;ve retrieved tens of thousands in the past.  So far as I remember, it&#039;s a case of choosing &quot;save to file&quot; from a set of search results.&lt;/p&gt;
&lt;p&gt;Second, sequence retrieval is certainly harder than it should be.  But you need to ask yourself if your initial approach is a good one.  Do you really need to start with a 1 GB file or can you cut it down using some pre-filter?  And sequence retrieval using descriptions is never a good idea.  The fasta header contains 2 components:  an ID (anything between the &amp;gt; and the first white space) and a description (the rest).  The description is freeform.  It can be whatever people want, or nothing at all, or a totally misleading annotation.  This is why we have unique identifiers.  Can you use some feature of the sequence itself for retrieval?  Length, pI, presence of a domain detected using HMMER versus Pfam?  Bioperl is very good at that kind of search and retrieval.&lt;/p&gt;
&lt;br class=&quot;clear&quot; /&gt;</description>
 <pubDate>Mon, 10 Sep 2007 08:23:59 -0400</pubDate>
 <dc:creator>Neil</dc:creator>
 <guid isPermaLink="false">comment 4197 at http://www.nodalpoint.org</guid>
</item>
<item>
 <title>simple solution</title>
 <link>http://www.nodalpoint.org/2007/08/06/roundup_extract_a_sequence_from_a_fasta_file#comment-4196</link>
 <description>&lt;p&gt;Let&#039;s get back to basics on this question. From what I&#039;ve read above it seems as though we&#039;re not quite there yet when it comes to doing  simple tasks on very large datasets. The bulk of what we use with bioperl, EMBOSS, etc, work great with 10-,100- thousands of sequences at most(depending on their sequence length ofcourse).&lt;/p&gt;
&lt;p&gt;If you&#039;re trying to find sequences based on matching headers then a very simple perl script should do the trick. No bioperl and perhaps a pipe or to/from grep should get it done. Writing something that runs through the input file no more than once will also help.  Something I find quite useful with gawk and perl are the ability to specify the input record separator, &quot;$/&quot; , and assigning it to something logical like &quot;//&quot; for Genbank entries.&lt;/p&gt;
&lt;br class=&quot;clear&quot; /&gt;</description>
 <pubDate>Mon, 10 Sep 2007 04:26:21 -0400</pubDate>
 <dc:creator>badboyz</dc:creator>
 <guid isPermaLink="false">comment 4196 at http://www.nodalpoint.org</guid>
</item>
<item>
 <title>Roundup: Extract a sequence from a fasta file</title>
 <link>http://www.nodalpoint.org/2007/08/06/roundup_extract_a_sequence_from_a_fasta_file</link>
 <description>&lt;p&gt;HMMs, SVMs, MCMC - interesting topics! But I have simple problems. Here is one: How do I extract some sequences from a fasta file if my a accession number are not Genbank ids themselved but other words that are still in the header?&lt;/p&gt;
&lt;p&gt;&lt;i&gt;Materials:&lt;/i&gt;&lt;br /&gt;
Hm. Pubmed won&#039;t export so many sequences from the web interface (at least I could not find a way, limit is 100) If I was a biologist, I would probably repeat the process manually 70 times to get 70 * 100 sequences. Which might have actually saved me a lot of time. But I wanted to be clever.&lt;/p&gt;
&lt;br class=&quot;clear&quot; /&gt;&lt;p&gt;&lt;a href=&quot;http://www.nodalpoint.org/2007/08/06/roundup_extract_a_sequence_from_a_fasta_file&quot;&gt;read more&lt;/a&gt;&lt;/p&gt;</description>
 <comments>http://www.nodalpoint.org/2007/08/06/roundup_extract_a_sequence_from_a_fasta_file#comments</comments>
 <category domain="http://www.nodalpoint.org/master_list/bioinformatics">Bioinformatics</category>
 <category domain="http://www.nodalpoint.org/nodalpoint_tags/command_line">command line</category>
 <category domain="http://www.nodalpoint.org/computer_science/data_management">Data management</category>
 <category domain="http://www.nodalpoint.org/nodalpoint_tags/defline">defline</category>
 <category domain="http://www.nodalpoint.org/nodalpoint_tags/fasta">fasta</category>
 <category domain="http://www.nodalpoint.org/nodalpoint_tags/genbank">genbank</category>
 <category domain="http://www.nodalpoint.org/nodalpoint_tags/header">header</category>
 <category domain="http://www.nodalpoint.org/nodalpoint_tags/searching">searching</category>
 <pubDate>Mon, 06 Aug 2007 06:07:11 -0400</pubDate>
 <dc:creator>maximilianh</dc:creator>
 <guid isPermaLink="false">2272 at http://www.nodalpoint.org</guid>
</item>
</channel>
</rss>
