Hi all,
I've whipped up some code for a web project I'd like to get started - it's kind of like a life science specific Technorati. It collects blog posts via RSS from life science blogs and then does useful and interesting things with that data (in theory). It's at www.postgenomic.com.
One of the aforementioned useful and interesting things that the site does is to act as a centralized repository for reviews of papers and for conference reports written by science bloggers. The site can do some of this just by text parsing, but obviously an approach involving some sort of semantic markup would be better. If anybody has thought - or would like to think - about, for example, what kind of structured information a conference report should convey then please pitch in.
The code and database are open source (I'm not sure what the exact legal status of a database full of other people's blog posts is, actually), so if you'd like to munge the data in some new and interesting way or to improve the site (shouldn't be hard) get in touch and I'll send you the relevant files (Perl scripts, MySQL dumps and PHP for the web interface). More importantly, though, I'd like to get some ideas about how to best markup reviews and reports.
I've put a few more details on F&L, here, but check out the site to get a picture of how it works.
Usual caveats about freshly minted webapps apply - the server is slow, the database isn't very big yet and the scripts will probably cack out if you do anything fancy.
Cheers
Stew


Firstly, great work, someone
Firstly, great work, someone needed to 'throw down the gauntlet' in this space. Second you've now opened the proverbial 'can-o-worms'. Hosting, bandwidth, metadata standards, RSS shenanigans, VC funding, the eventual IPO, rock star science status etc.
As far as hosting goes, you'll want a deal that can offer reliability of service. People will get over it quite quickly if they can't access the service. If you put your hand up and ask for commercial support you are likely to get it. What kind of compromises you'll need to make are anyone's guess.
My current hosting could probably handle it for a while, but if it got popular then financing more bandwidth either through advertising or fund raising would be necessary. Hosting at work is always an option if no one notices, but if they do, someone has to pay. You could get is sponsored via an internal grant, given that it is open source and beneficial to the scientific community, but that would be a hard sell.
More structured metadata in posts would be nice for this service. I like the idea of simple semantic markup (rel="review", meta tags etc.) however something more along the lines of the structured blogging initiative would be better (tool support needed of course). I'll see what I can do to modify drupal to allow user generated metadata on posts. I suspect I'll need to build a new node type 'review' for this purpose.
Data quality is an issue of course, there are tow approaches here, be a gate keeper of quality (i.e. personally vet sites before adding them to the index) or let it be a free for all. It will be interesting to see what works best for the life sciences domain.
Lastly: release often, release early :)
Structured Blogging in Drupal
There have been mentions of Structured Blogging in Drupal. I presume it's being worked on.