Massive PubMed retrievals

How would one collect all hits to a general PubMed query (eg "mice"[MeSH Terms] ==> 664818)? This would be useful for eg text mining sets. I believe NCBI will provide a hard copy of the entire database on request/licence agreement, but that would date quickly. Any interactive ideas? I guess the format would preferably be XML (failing that, medline/bibtex?)


Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

re: that would date quickly?

I mirror pubmed data locally from their ftp server and have no problems keeping it up to date. They provide a main section that contains the full release for the current year and an update section that contains updates since the full release. All files are xml and the update files are provided on a daily basis. So keeping it up to date isn't difficult. The problem is that you then have to implement your own interface for searching the xml files. The other issue that may or may not be a problem is disk space. Medline is taking up ~50 GB right now.


Careful - NCBI limits on eutils

NCBI have pretty strict rules on robots using eutils.
http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html#UserSystemRequirements
They WILL block you if you don't follow. My wrappers for eutils all include a 'sleep(3)' in them so I won't break their rules.


ESearch and EFetch

http://www.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term="mice"[MeSH]&retmax=700000&usehistory=y

Grab the webenv and query_key values from the result, then:
http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&webenv=Y...

Be prepared for a big download.


go crazy

EUtils is great and quite easy to access with Perl/Python once you understand the URL syntax.
You could go mad and automate it to grab weekly references, parse the XML, whip up a web page or convert the XML to RIS and store in MySQL using RefDB - all things I'd like to do if I had the time.


Python EUtils library

Andrew Dalke has written a python wrapper to the EUtils interface, available here. This is also available as part of the Biopython distribution.

from EUtils.ThinClient import ThinClient

eutils = ThinClient()

results = eutils.esearch("mice", field="mesh")

print results.read()

<eSearchResult>
<Count>664818</Count>
<RetMax>20</RetMax>
<RetStart>0</RetStart>
<IdList>
<Id>14647387</Id>
<Id>14985530</Id>
<Id>14985529</Id>
<Id>14978479</Id>
...


RetMax and RetStart

By default, I believe RetMax (return maximum number of results) and RetStart is 20 and 0, respectively. So you will only get a list of 20 Id's at a time. So you'll need to re-define RetMax to however much you want -- however I've had problems getting more than 500 returned (off the top of my head, but I could be wrong). So your best bet would be to get say 500 at time and step your way up -- that is, set RetMax=500,RetStart=0 on your first run and RetMax=500,RetStart=500, on your second, etc.


Maximum 500

I have also run into the 500 max, is there a (easy) way around this?

Thanks


NCBI helpdesk

Try emailing the NCBI help desk (the link is on the homepage) about this limitation. It would be nice to hear what the NCBI guys have to say about massive pubmed retievals and the best way to go about them.

Could you lets us know what the response was ?


500 limit

I wrote them, here is what they said:

At present there is a 500 limit but we do hope to increase that at some time
soon. Note that you can set up a loop with retstart and retmax.

Can someone give an example of how to set up the loop? Say I have a file with 100,000 accessions and I want to get the FASTA files for them.


500 Limit loop

You shoul use this, in fact this is the example from NCBI in Perl (I have done the same in PHP if you are interested).

sub ask_user {
print "$_[0] [$_[1]]: ";
my $rc = ;
chomp $rc;
if($rc eq "") { $rc = $_[1]; }
return $rc;
}

# ---------------------------------------------------------------------------
# Define library for the 'get' function used in the next section.
# $utils contains route for the utilities.
# $db, $query, and $report may be supplied by the user when prompted;
# if not answered, default values, will be assigned as shown below.

use LWP::Simple;

my $utils = "http://www.ncbi.nlm.nih.gov/entrez/eutils";

my $db = ask_user("Database", "Pubmed");
my $query = ask_user("Query", "zanzibar");
my $report = ask_user("Report", "abstract");

# ---------------------------------------------------------------------------
# $esearch cont


Download the fasta databases

EUtils is probablly not the right approach if you just want the fasta files. I would recommend downloading the appropriate database in fasta format. However if you don't have access to that much diskspace or bandwidth then EUtils is probably your only option.

The question then becomes which of the available bioinformatics libraries do you want to use to do this ? Bioperl, biopython ?

Post you problem as a new blog entry, you are bound to get a number of useful responses from fans of Bioperl and Biopython :)


Use elink inteface.

Have you tried elink utilities for Entrez server?
http://www.ncbi.nlm.nih.gov/entrez/utils/utils_index.html

Here is the linke to the example script. I think this will address your problem.

http://www.ncbi.nlm.nih.gov/entrez/query/static/eutils_example.pl