Monday, November 17, 2008

Why do we blog?

Martin Fenner, asked some questions to science bloggers in Nature Networks that I think are interesting. Plus, the meme is going around my blogging neighbourhood so I thought I would join in as well:

1. What is your blog about?
It is mostly about science and technology with a particular focus on evolution, bioinformatics and the use of the web in science.

2. What will you never write about?
I will never blog about blog memes like this one. I tend to stay away from religion and politics but never is a very strong word.

3. Have you ever considered leaving science?
Does this mean academic research, research in general or science in general ? In any case no. I love problem solving and the freedom of academic research. The only thing I dislike about it is not being sure that I can keep doing this for as long as I wish.

4. What would you do instead?
If I could not do research I would probably try to work in scientific publishing. Doing research usually means that we have to focus on a very narrow field. Editors on the other hand are almost forced to broaden their scope and I think I would like this. I would also be interested in the use of new technologies in publishing.

5. What do you think will science blogging be like in 5 years?
Five years is a lot of time for the pace of technological development but not a long time for cultural change. I could be wrong but, if anything, there will be only a small increase in adoption of blogging as part of personal and group online presence along with the already existing web pages. I wish blogging (and other tools) would be use to further decentralize research agendas from physical location but I don't think that will happen in 5 years.

6. What is the most extraordinary thing that happened to you because of blogging?
I have gained a lot from blogging. The most concrete example was an invitation to attend SciFoo but there are many other things that are harder to evaluate. In some ways it is related to the benefits of attending conferences. It is useful because you get to interact with other scientists, exchange ideas, forces you to think through different perspectives, etc.

7. Did you write a blog post or comment you later regretted?
I probably did but I don't remember an example right now.

8. When did you first learn about science blogging?
As many other bioinformatic bloggers I started blogging in Nodalpoint, according to the archives in November 2001. I started this blog some two years after that.

9. What do your colleagues at work say about your blogging?
Not much really, I don't think many of them are aware of it. If any, the responses have been generally positive but I don't usually find many people interested in knowing more about blogging in science.

Wednesday, November 12, 2008

Open Science - just do it

My blog is 5 years old today and to celebrate I am trying to actually do some blogging. There are a couple of reasons why I have blogged less in the past months. In part it was due to FriendFeed and also in part because I was trying to finish a project on the evolution of phospho-regulation in yeast species. Nearing the end of a project should actually provide some of the most interesting blogging material but I did not ask for permission from everyone involved to write about ongoing work.

I have to admit that although I have been discussing and evangelizing open science for over two years I have done very little of it. I have used this blog sometimes to put up small analysis or mini-reviews but never to describe ongoing projects. I have tried to start a side-project online but I over-estimated the amount of "spare cycles" I have for this. So, I have talked it over with my supervisor and I am now free to "risk" as much as I want in trying out Open Science. The first project I will be trying to work on will be on E3 target prediction and evolution.

Prediction and evolution of E3 ubiquitin ligase targets
As I have mentioned above, I have been working in the past months on the evolution of phosphorylation and kinase-substrate interactions in yeast species. I am interested in the evolution of regulatory interactions in general because I believe that they are important for the evolution of novel phenotypes. This is why I will be trying to study the evolution of E3 target interactions. In order to get there I will try first to develop some methods to predict ubiquitination and E3 targets. Since a lot of the ideas and methodology applies to other post-translational modifications and even localization signals I will in the future try to generalize the findings to other types of interactions.

Some of the questions that I will try to address:
- How accurately can we predict E3 substrates ?
- How quickly in evolution do E3-targets change ?
- Is there co-regulation by kinases and E3s on the same targets (and how these evolve) ?

Once I have something substantial I will open a code repository on Google Code.

Tuesday, September 02, 2008

Books: long tails and crowds

I read two interesting books recently that relate to how the internet is changing businesses and society in general.


“The Long Tail” by Chris Anderson ends up suffering from its own success. I was so exposed to the long tail meme before reading the book that there were very few novel ideas left to read. The book describes the business opportunities that come from having a near-unlimited shelf space. While physical stores are forced to focus on the big hits, long tail businesses sell those big hits but also all the other niche products that only a few people will be interested in. There is a big challenge in trying to guide the users to those niche products that they will be interested. Anderson provides examples of recommendation and reputation engines from several companies (ie. Amazon, iTunes, eBay) that by now most of are familiar with. Even for those well exposed to log normal distributions and long tail businesses the book is still worth getting as a resource and for the very interesting historical perspective on the origins of long tail businesses.

“Here Comes Everybody” is an excellent book by Clay Shirky that describes the huge decrease in cost of group formation that we are currently living. Through a series of stories Shirky demonstrates how the internet facilitates group formation and how collective actions that before were impossible are now become the norm. His stories touch on ideas as simple as the photo collections in Flickr to the coordination of regime opposition in Byelorussia. I appreciate the somewhat neutral stance on the phenomena. The book covers cases where online groups almost change to a mob like mentality and others were groups of consumers were able to stand up to corporations to guarantee their rights. The outcome of easy group formation for the future of society is not easy to predict and this is well conveyed in the book.

The subjects and stories from these books are interesting for scientists also because they can influence the way we work. Science is a long tail of knowledge with many niche areas that only a few people in the world care about. The recommendation and reputation engines described could help us navigate the body of knowledge to find those bits that interest us the most. Also, easy group formation might one day shift the way we work so that the innovation and research is not determined by physical location but instead focused on the research problems.

Tuesday, August 12, 2008

Freebase parallax

Freebase parallax is a new browsing interface for Freebase. It allows the user to drill in and connect sets of objects to other sets of objects within Freebase and draw maps and graphs with the information. This really shows the power of having well structured data available online. Here is a video describing how it works with great examples of data mining:

Sunday, August 10, 2008

Post-publication journals

With the increase in the number of journals and articles being published every year and the possibility of having an even larger set of "gray literature" available online we face the challenge of filtering out those bits of information that are relevant for us.

Let us define as "perceived impact" this subjective measure of importance that some bit of information holds for us as scientists. This information is typically an article but it could be applied later to pre-prints and database entries in general.

Everyone of us creates some rules to select from the constant stream of scientific output what to pay attention to. We could picture this sorting process in the form a triangle with a large base of very specific knowledge that is somewhat important to us and a small amount of more general but highly important content at the top. For the majority of scientists today, these sorting rules are based on journal topic (cell biology, physics, evolution, etc) and journal impact factor. Below the base we could place the gray literature that today is mostly out of sight and is not peer-reviewed.

With the advent of the web and in particular the social aspects of this new medium we should expect better than evaluation of articles based on the quality of the journal that it was published in. In the words of Eugene Garfield, the inventor of the impact factor:

“In order to shortcut the work of looking up actual (real) citation counts for investigators the journal impact factor is used as a surrogate to estimate the count. I have always warned against this use”. Eugene Garfield (1998)
Scientific publishing is now digital with every article having an universal digital identifier (DOI). However, as an author I can get (for free) much more information about how people are using the content from this blog than for articles I published. Information about the number of downloads, citations in other articles, in scientific blogs or in bookmarking services could help us sort through information in a better way than relying solely on journal editors (impact factors). We should be using the social web to re-sort articles after peer-review to reflect our preferences:
How would we build such a personalized sorting system ? In the words of the chief-editor of Nature:
(…) nobody wants to have to wade through a morass of papers of hugely mixed quality, so how will the more interesting papers in such an archive get noticed as such? Philip Campbell

It is obviously challenging to use some of those metrics mentioned above as signals to rank the important of individual articles when they are so easy to game. On the other hand some of them are already useful and working today. I already subscribe to RSS feeds from some users of Connotea that consistently bookmark articles that I find useful. Similarly through FriendFeed I get recommendations of articles to read from people I trust. So, although I do not have a clear solution on how to build such a system I think there is a need for it and there are clear ideas to try.
Here is something like a mind-map of what I think would work best, a mixture of the social recommendations of FriendFeed with the pure algorithmic ideas of Google News:


These ideas of sorting based on measures of usage is already being tested by the new Frontiers journals. These are a series of open access journals published by an international not-for-profit foundation based in Switzerland. As PLoS ONE, these journals aim to separate the peer-review process of quality and scientific soundness from the more subjective impact evaluation. In practice they are doing this by publishing research in a tiered system with articles submitted to a set of specialty journals. The articles are evaluated based on the reading activity of the users and the top 10% advance up to the next tier journal.
So far Frontiers has started with neuroscience specialty journals with a single top tier journal (Frontiers in Neuroscience) but if this is successful they could easily add other disciplines and have a third tier on top of very general content. In order to contribute to the evaluation procedure, readers must fill out their profile. This information is taken into consideration since they will rank users usage metrics differently according to their expertise.

Summary
No single individual wants to go through all published literature to find the useful information but together we effectively do this. The challenge is how to evaluate specific articles by a combination of metrics to promote them to wider audiences in a way that is not easy to exploit. Kevin Kelly said recently in a Ted Talk that "The price of total personalization is total transparency". Would this bother scientists ? Lets say that a few science publishers get together with some of these scientific social sites (social networks, bookmarking sites) to mimic the Frontiers model in a larger scale. Users would install a browser plugin that would link their scientific profile and social contacts with their reading activity. The publishers could then use this information to create personal reading hubs for users.

Saturday, August 09, 2008

BioBarCamp wrapup

In the last two days I attended the first BioBarCamp here in the bay area in the Institute for the Future. There is a lot of micro blogging coverage of the event in FriendFeed and even some recorded video from Cameron Neylon (click on demand and pick BioBarCamp).

The meeting was fun due of the unstructured nature of the event and also because I got to meet a lot of people I knew only from blogs. Two highlights of the event were the talks by Aubrey de Grey (see notes and also Cameron's video above) and Jon Trowbridg from Google that talked about this.

There were four parallel discussions going on but I kept mostly with the open science and web tolls related talks. There are a couple of ideas that I take away from these discussions that I will mention below but in general these overlap with what Shirley already mentioned in her post.

Pragmatic steps for Open Science and web tool adoption
Kaitlin Thaney and Cameron Neylon talked about open science and data commons. Cameron in particular is making the case that we need to demand open data the same way we demand for open access to science articles. Although publishers will say that they already try ask for availability to everything required to reproduce the results the truth is that this is not really well enforced. Funding agencies should provision funds to make raw results freely available for re-use once an article is accepted for publication.

On the side of web tools for science, Ricardo Vidal (OpenWetWare), Vivek Murthy (Epernicus), Jeremy England and Mark Kaganovich (Labmeeting) discussed user adoption. Adoption rates among scientists tend to be slow and there is a large generational gap. Again here pragmatic steps need to taken to promote the usage of these tools in science. Some of the current problems include fragmentation of user base, lack of focus in tool development, too few security restrictions.

These tools should try to focus on solving a few important problems really well. Examples of these problems include finding the person in my network that might have some expertize that I need; better ways to find articles that I find relevant or to manage my lab notebook and article library, etc. To reduce the fragmentation of user base it would be great that these websites find a way to share the social graph.

Finally the question of privacy online was again revisited. The idea of having open lab notebooks that anyone can see (as in OpenWetWare) might be a bit too radical and put away users that want to try the tools without the risks associated with exposing your research online. As has been discussed elsewhere there are advantages in having electronic notebooks (easier to access, share with peers and backup) but very few people will risk having their lab notebooks freely available online. Therefore allowing for privacy should increase usage.

Sunday, July 27, 2008

Some backlash on Open Science

During ISMB, thanks to Shirley Wu (FF announcement), there was an improvised BoF (Birds of a Feather) session on web tools for scientists. Given that the meeting was not really announced we were not really expecting a full room. I would say that we had around 20 to 30 people that sayed at least for a while. We talked in general about tools that are useful in science (things like online reference managers, pre-print archives, community wikis, FriendFeed, Second Life) and we also talked a bit about the culture of sharing and open science.

Curiosly, the most interesting discussion I had about open science was not at this BoF session but after it. In the following day the subject come up again in a conversation between me and tree other people (two PhD students and a PI from a different lab). I will not identify the people because I don't know if they would like that or not. The most striking thing for me about this conversation was the somewhat instinctive negative reaction against open science from the part of the two PhD students. After a long discussion they made a few interesting arguments that I will mention below but what was strange for me was that this is the first time I see someone react instinctively in a negative way against the concepts of open science.

One of the students in particular was arguing that the fact that scientists sharing their results online (prior to peer review) is not only silly on their part (the scooping argument) but it would be detrimental to science as a whole. The most concrete argument he offered was that seeing someone "stake claim" to a research problem might scare other people away from even trying to solve it. I would say that it would be better to have people collaborating on the same research problems instead of the current scenario where a lot of scientists waste years (of their time and resources) working in parallel without even knowing about it. He argues simply that some people might not want to collaborate at all and should be allowed to work in this way. I don't think scientists should be forced to put their work online before peer-review, I just happen to think that this would improve collaborations and decrease the current waste or resources.

The second argument against sharing of research ideas and results prior to peer review was more consensual. They all mention the problem of noise and how it is already difficult to find relevant results in the peer reviewed literature. They suggest that this problem would be further increased if more people were to share their ideas and results online. I fully agree that this is a problem but not related at all with open science. This is a sorting/filtering problem that is already important today with the large increase in journals and published articles. We do need better recommendation and filtering tools but sharing ideas and results in blogs/wikis/online project management tools is not going to seriously increase the noise since these are all very easily separated from peer-reviewed articles. No-one is forced to track shared projects, but if they are available it would make it that much easier to start a collaboration when and if it makes sense to do so. Are open source repositories detrimental to the software industry ?

It took around 3 years since people started discussing the idea of open science and open notebooks for these concepts to get some attention. It is inevitable (and healthy) that as more people are exposed to a meme that more counter-arguments emerge. I guess that a backlash only means that the meme is spreading.