Resources for text mining?

Hello.
Recently I began studying the basics of text mining (related to literature mining) for a small project in our laboratory. I would like to know if there are any good introductory resources (be either online, or books) to get a good overview of the subject in a biological perspective (because I'm a biotechnologist and I'm still new to the computational field). Any help would be appreciated.
Thanks a lot.


Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Text Data Mining: Books and Open Source Software

The best book to start with for a big picture is Witten and Frank's "Data Mining, 2nd Edition". The best book for the gory details of the math is Hastie et al.'s "The Elements of Statistical Learning". Unfortunately, neither of these are specifically about text data mining. For text, the best reference is still Manning and Schuetze's "Foundations of Statistical Language Processing", but it doesn't give you nearly the same kind of big picture view of data mining, nor does it cover as many classification and clustering techniques.

You can also look at some of the open source software packages. For instance, we offer LingPipe, which you can download with Java source and doc. There are tutorials for downloading MEDLINE, parsing its XML format, extracting named entities (e.g. proteins and cell lines), and putting the results in a MySQL database. There are also tutorials on indexing MEDLINE for search, doing part-of-speech tagging for biology texts, extracting named entities, doing sentence extracting for biology texts, etc.

There's a similar package released on SourceForge called OpenNLP, but it doesn't contain biology specific modules as far as I know. It's a little more researchy and less industrial than ours. And then there's Steve Bird et al.'s NLTK package previously mentioned, which is aimed at learning, and is written in Python. Then there are some more sophisticated statistical packages, such as Andrew McCallum's Mallet (UMass) and William Cohen's MinorThird (CMU), both in Java. Cohen, in particular, has done a lot with MinorThird in biomedical text data mining.

Bob Carpenter

Alias-i, Inc.


Firstly thanks for the

Firstly thanks for the informative comment. I'll be saving those books to my Amazon wishilist.

Second, in case anyone from industry/business is reading this, Bob's comment above is an excellent way to get people in a community forum like nodalpoint to notice your products without being annoying about it.


Thanks for both comments.

Thanks for both comments. The book looks interesting, I'll give it a go.
As for Python, by chance it's exactly the programming language I'm learning, so that library will come in useful.


Text mining for Biology and Medicine

You might find Text mining for Biology and Medicine a useful starting point if you are new to this field.


It all depends on what

It all depends on what language you want to use. If you're willing to give python a try you might want to investigate the Natural Language Tool Kit. It was designed to be a base library for teaching NLP in undergraduate computer science classes. It is general library covering most of the popular methods of text mining, and not specific to biology. The documentation is also good, with a lot of background material on NLP in the tutorial.