Dear Santa, all I want for Christmas* is a better version of Connotea, please can you sort out it's duplicated redundant links? In my book this particular bug is “buggotea” number one. Here is the problem... [update: buggotea is partially fixed, see comments from Ian Mulvany below]
There is this handy bioinformatics web application called Connotea which I like to use, built by those nice people in the web team at Nature Publishing Group. Most readers of nodalpoint probably already know about it, but because you're Santa and you've been busy lately, let me explain. Connotea can help scientists (not just bioinformaticians) to organise and share their bibliographic references, whilst discovering what other people with similar interests are reading. It's good, but it has some bugs in it. Since it's open-source software, anyone with the time, inclination and skills can get hold of the connotea source code and improve it. There is, however, one particularly nasty redundancy bug in Connotea that is bugging me [1]. I think it should be fixable, and that doing so would make Connotea a significantly better application than it already is. Let's illustrate this bug with a little story...
I have five bioinformatics colleagues with Connotea usernames glycine, methionine, threonine, tyrosine and valine. They are all web-savvy researchers who use Connotea to manage and share their references. Like many bioinformaticians, they are also desperate perl hackers and one of their favourite papers is The Bioperl Toolkit: Perl Modules for the Life Sciences. This highly-cited paper by Jason Stajich et al published in Genome Research describes the libraries available in Bioperl.
My first colleague, glycine, found Jason's Bioperl paper by browsing PubMed. Using Connotea they bookmarked a PubMed link, a particular type of Uniform Resource Identifier (URI), shown below:
So far so good. My next colleague, tyrosine, also bookmarked a PubMed link, but a subtly different URI for the same paper. This is because the "dopt" (display format) parameter in the URI has a different value, "Abstract" instead of "AbstractPlus" like this:
It's a small difference, but, as we shall see, it has important consequences. Another colleague, threonine, found the paper on the genome.org website and bookmarked the URI of the papers full content:
...While valine just bookmarked the URI of the abstract at
Meanwhile methionine, who is a big fan of Digital Object Identiers (DOI), bookmarked the paper's DOI (doi:10.1101/gr.361602), magically transforming it into a URI by prefixing it with http://dx.doi.org like this:
Finally, duncan (that's me) finds the paper from a PubMed search, and bookmarks the URI below from the search results like so:
It won't take a brain surgeon to realise from the above story that six different people bookmarked six different and redundant URI's for the same paper. With many URI's representing any given paper this is very common. Now, one of the useful features of Connotea, is that while bookmarking all these different URI's it uses CrossRef to retrieve any relevant metadata. The author, journal, publication date and any unique identifier(s) are automagically retrieved to save us the hassle of typing them in. So, despite the fact that we've all bookmarked different URI's for the same paper, Connotea shows that three users have actually bookmarked the paper identified by the PubMed identifier PMID:12368254 and five users have bookmarked the object identified by doi:10.1101/gr.361602, see [2] for examples.
The trouble is, Connotea doesn't currently use this metadata intelligently to reason that these URIs all represent the same paper. Because they are different URIs, it naïvely treats them as if they are completely different papers, even though they share DOI and PubMed identifiers. Of course, this redundacy isn't really the fault of Connotea, it's an inherent part of the web, but the result is unfortunate and avoidable fragmentation. Instead of Connotea showing Posted by glycine and 5 others to bioperl underneath each bookmark (which is what should ideally happen), it displays Posted by glycine (and 0 others) to buggotea.
With different URIs bookmarked we can't see accurately how many people have bookmarked a given paper as most incorrectly appear to have been bookmarked only once or perhaps twice. Neither can we see who has bookmarked any given paper, unless they happened to use exactly the same URI, which is pretty unlikely. Now you can obviously look this kind of popularity data up in various citation databases, but each of these has its own unique flaws, and social bookmarking is supposed to be what Connotea is all about. The result of all this is that the shared tagging and web 2.0 goodness of Connotea is mostly lost, mysteriously disappearing just like you do, when you've finished delivering your presents on Christmas eve.
So Santa, if you're reading this blog, and you know any talented perl-hacking elves with database expertise, could you please ask them to sort this ugly redundancy bug [3]? I hope they won't be too busy helping you wrap presents and look forward to seeing a better version of Connotea sometime soon. Have a very Webby Christmas!
[Posted by santaclaus and 10 others (mostly elves) to xmas wishes]
References
- Possible Problems with Connotea: Redundant URIs
- Buggotea: The Redundancy Bug in Connotea
- You've read about redundancy, now buy the T-shirts: Department of Redundancy Department and I love ❤ redundancy
- *I should really have said, “All I want for Christmas apart from world peace and an end to poverty and that book I mentioned and a Dukla Prague Away Kit [5] and ... etc ”
- Half Man Half Biscuit (1986) ♫ All I want for Christmas is a Dukla Prague Away Kit ♫
- Citeulike feature requests: Shared bibliographic info for a paper across all of citeulike
- Connotea or Citeulike?

This work is licensed under a
Creative Commons Attribution-Noncommercial-Share Alike 3.0 License.


Comments
I use this solution:
This problem bugged me badly, because my library was getting littered with numerous duplicate copies of the same Pubmed abstract. Because Pubmed URLs often contain session specific info like query_hl=10, the same abstract from 2 different sessions would get added twice to my library.
So I googled for "javascript tutorial" and then hacked the bookmarklet javascript to strip off the PMID from the URL and construct a new minimal URL in a consistent format to send to connotea. Another advantage of this bookmarklet is that in cases where a pubmed search returns only one abstract, the PMID is not in the URL at all (for instance, this search: search pubmed for term connotea). In this case, I can highlight the PMID on the page and then clicking the bookmarklet constructs the URL with the highlighted PMID and sends it to Connotea.
The bookmarklet javascript is here:
http://brahms.cpmc.columbia.edu/~suresh/connotea-bookmarklet.html
Obviously, everyone would have to use this solution to make their references consistent with the URL format I happened to use, but at least I dont end up enthusiastically adding 10 copies of the same *very important* abstract to my library myself.
HTH, S.
This is why I use citeUlike
It's good only for papers, but gets rid of the redundancy, and the abstract is in there.
Merry Christmas
Citeulike, Buggotea and DOI's
Yes, citeulike doesn't suffer from buggotea, although it doesn't let you post DOIs as URIs e.g. in the form http://dx.doi.org/10.1101/gr.361602 it seems that citeulike generally spurns DOIs altogether?
HMHB
This is for sure the first reference to Half Man Half Biscuit on Nodal. I rate it highly purely for that reason :)
Us NPG pixies hear ya
This is something that we've been meaning to address on connotea.org for a while, but other more pressing concerns keep cropping up. The basic problem is that Connotea is keyed on URLs rather than identifiers or a combination of the two.
We could normalize the URLs, but I'm not sure that this is a good idea: we'd be making assumptions about what the user intended to bookmark in the first place (the paper? The abstract as it appears on PubMed? The free full text version on PubMedCentral? The paper as accessed through a proxy of some kind?).
What could / should be done is a fix to make the bookmarked by x other people snippet and URL info pages take shared identifiers into account. We'll definitely look at this in the new year, unless anybody fancies having a go with the open source release, in which case you rock and should feel free to email if you need any help.
Trumpton Riots
Thanks to NPG for your rapid response. As for Half Man Half Biscuit, I suppose obscure 1980's english punk bands are a little "off topic" for a bioinformatics blog but I couldn't resist the reference!
Buggotea, first pass fix
Last Friday we released a major update to connotea with the aim of resolving Buggotea. Prior to this update the main database tables were something like this:
user
bookmark
user_bookmark
user_bookmark_comment
user_bookmark_details
user_bookmark_tag
and the unique key in the bookmark table was the hash of the url. This is where the entire problem with buggota stems from. All of the functionality of conontea was built on top of this structure. You could say that philosophically the structure was set up for people to share web pages that represented articles rather than articles represented by web pages.
We have changed the tables to the following structure:
user
article
bookmark
user_article (with preferred bookmark and citation pointers)
user_article_comment
user_article_details
user_article_tag
Where the article is an abstract entity more representative of the fact that people want to reference an abstract piece of work that can exist in many different locations. We now normalise bookmark entries via PMID and DOI into a single article object. It seems to be working so far for entries in connotea that have PMID and or DOI information.
This is a first step and for sure there are things that can be improved, support for normalisation based on other meta data, how the information about an article with multiple instances is displayed, concatenation of the best metadata from multiple authoritative sources. Its not rocket science but this is a good step towards making connotea more useful in the academic realm. Anyway, I thought you guy's might be interested. Let me know if you spot any odd behavior with the system.
Thanks Ian, buggotea is almost fixed
Thank you Ian for doing this. Connotea seems to perform better now. E.g. the BioPERL paper referred to in the original buggotea post
http://www.connotea.org/article/b0a2feab5e48494d3b03c39add264f22
has been bookmarked by "user tyrosine and 6 others" while
http://www.connotea.org/article/eff51b48308ba0bb333e598234c742aa
has been bookmarked by "user valine and 3 others". This is much better, because before I think they were all "user x and 0 others".
I think normalising URIs is a hard problem...