MEDIE is an “intelligent” semantic search engine that retrieves biomedical correlations from over 14 million articles in MEDLINE. You can find abstracts and sentences in MEDLINE by specifying the semantics of correlations; for example, What activates tumour suppressor protein p53? So just how useful is MEDIE and is it at the cutting edge?
At the Manchester Interdisciplinary Biocentre (MIB) launch yesterday, Professor Jun'ichi Tsujii gave a presentation on Linking text with knowledge - challenges for Text Mining in Biology. As part of this presentation he gave a demonstration of Medie: an intelligent search engine for Medline. This tool looks quite impressive if you experiment with some sample queries. I wonder what nodalpointers, especially hardened text-miners, natural language processing (NLP) nerds and computational linguists, make of Medie?


Comments
not overly impressed
I didn't think much of Medie based on the sample queries, or the few queries that I tried. But then, I have a low opinion of text mining in general. I get the theory - sentences have structure (e.g. "A is something-ed by B", "X somethings Y") and there are a finite number of variations in this structure, so we should be able to derive rules from it. In my experience though, that's not what happens. I don't know whether it's because we need more rules, or more complex rules, but my suspicion is that in a medium without constraints where people can write in any style that they choose (like an abstract), they will find a way of expressing themselves that confuses the rule machine. Let's face it, the standard of written English in many abstracts is not high. There's a lot of subtlety in expression too: "leading to enhanced phosphorylation" is not the same as "phosphorylates".
I think the only sure way to describe biological interactions in a way that allows accurate mining is controlled ontologies. You could spend years refining text mining rules - or you could set up a standardised descriptive system at the outset and use that. It's the same old story - you wonder if some people really want to address the problems of biologists or whether they're just cashing in because text mining is fundable at the moment.
From MEDIE team
I would like to make comments, since I was the speaker on MEDIE at MIB.
1) Perhaps, our website of MEDIE and my presentation gave wrong impression. MEDIE intends to show general functionalities that the parsing (NLP) technology can provide for intelligent text mining, information retrieval, etc. As it is now, it does not intend to be a system for performing a specific task like extracting protein-protein interactions from text.
2) We are fully aware that another task-specific layer of software (or a set of rules) is needed. You are right in saying that we need a huge set of rules in this layer. In short, to reduce the number of rules or to introduce different technologies in this layer such as statistical models (instead of “rules
text mining vs. ontologies
I see what you mean, there are obvious limitations to text mining. But then there are obvious limitations to controlled vocabularies and ontologies as well. Whereas text mining can at least try to make sense of millions of abstracts, who or what is going to accurately annotate 14 million medline abstracts with ontological terms? I think GoPubMed is a nice demonstration that combines the text mining and ontological approaches, although its far from perfect.
who or what is going to
who or what is going to accurately annotate 14 million medline abstracts with ontological terms?
Nobody. I'm suggesting that abstracts in their present form are not a worthwhile data source for mining biological data. Sure, text mining can "try to make sense", but when you then spend all your time figuring out if it made a good job of it or not, what have you really gained? Better to build a working system from scratch, rather than make do with a kludge that doesn't really save you time or effort.
Who's annotating millions of abstracts?
The obvious smart-aleck ansswer is the National Library of Medicine. They're putting MeSH (medical subject heading) terms on every MEDLINE citation, of which there are now more than 15M. To quote from NLM's paper The MeSH Translation Maintenance System:
Structure, Interface Design, and Implementation:
NLM must have an army of folks keeping up with the roughly 2000 articles/day added to MEDLINE. And I have no idea how accurately this is done in terms of inter-annotator agreement.
An obvious alternative would be to allow user tags in the Web 2.0 sense. Users reading articles could add tags. Other users could then search by tag. It helps with recall and precision for Flickr and YouTube. It could work for research articles. As soon as this became even moderately popular, people begin tagging their own entries so that people can find them. Each paper has at least one individual (the author) with a vested interested in making the paper easy to find.
(Disclaimer: We have an NLM SBIR grant focused on biomedical text processing. We're working with real biomedical researchers because it's a real problem when you have 500 candidate genes listed by Entrez ID and you want to investigate the literature surrounding them.)
Bob Carpenter
Alias-i, Inc.