Keeping track of what was done.

Hey, very often we perform a bunch of analysis with hundreds of sequences.., then when , one month after, somebody asks "How did you made all these results?", I freeze. Why? Well mostly of time I work under pressure and with a lot of data, and I can't make a log of it. But how keep tracking what you did? How you repeat some analysis to show what was done? Software, what was the version? How many sequences we had at begining? And so on...

How keep track of what you did?

that is my question.


Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

"script"

Maybe useful, maybe not.

When doing some important-ish sysadmin stuff, I use "script", which is a program which will copy all your keyboard input, and all the resulting output into a file.

This may be overkill - you get to see which commands you got right, which you got wrong (for example), but its fairly easy to use, and does keep a record of everything you've done.


Key logger.

I don


On Debian, part of "bsdutils"

Its part of "bsdutils" on debian, so I guess the source would be available from ftp://ftp.debian.org/debian/pool/main/u/util-linux/util-linux_2.11n.orig...


how to be organised

This is a really good question. I have very similar problems, performing new sequence assemblies and analysis each time we receive more genome sequences and I have to admit, I am not as organised as I would like to be. It's a particular problem when the boss demands quick results, often recording what you do is an after-thought.
I suppose my simple answer is that I make use of the UNIX directory system a lot. So each new assembly gets its own directory, with a standard sub-directory tree structure containing various analyses and a separate directory of scripts which can be applied to batches of project files and so on. So in some ways, the directory organisation can help me remember what was done and in what order. But I guess what we're talking about here is project management software. I'd like to hear about free, open-source software that can help in this area.


Solutions?

Agreed. I find myself doing things that look great, and then can't remember how I did them.....

The file structure is a part of the solution, and I tend to use a system like that, too. However, it's often not enough. What is required is documentation of what has been done.....

A private blog, perhaps? If you run it locally, you can make "notes to self", and then upload it into a lab intranet occasionally - after editing for expletives :) What I'm getting at is a near-line blog, or a second, on-line one, devoted to work (read: not within most peoples' reach). Yes, I know that the point of blogs is to be accessible, but not if you just want to jot a few notes down.

I've been playing around with a number of post-it programs, which put sticky notes on your desktop, but these aren't really up to the task. I guess some reasonable content management is required. Still haven't decided what, though....


What should I write on the log?

In fact, system directories is the first level of organization, and some quick notes too. I fell that this problem is more common than I thought. I've played around with some tools like DCL and similars project management softwares but I never get satisfied, often they not fit in a Bioinformatic analysis. I think that any solution about logging what was done should be integrated in the analysis too. Like system directories, they are part of it.

Other problem is the hundreds of repetitive scripts I've done, one tool that helped a lot is CVS, and keeping all scripts in one directory,and never copying then to the project directory. Cause copy often brings you to the question, "What was the last version of it?".

I've worked about two years in a Molecular Biology Lab and they already know what information they should write on the log book and every new student learn with some older student what to log on the book, but Bioinformatics is new, massive data process and a massive data generator. So, we should track what?

One thing that I'm trying to do in my lab is write all information needed by a entity (your input, program, or else..) to be used again if necessary. I.e. For BLAST we often need to keep track of the version, parameters use, how many sequences were in the input and what was the database used. Moreover, the database has a source (it's was made here or downloaded from some other source?), has a date of last update, and so on... What I think is, if we track some key information about each of the "entities", we can reproduce or say "I did this way.." then we can produce some guide lines like "Don't forget to log that information about you similarity search before move on." If anyone shares the same felling we can try get all experience about analysis and build some general guidelines.
Any thought about it?

:wq


Wrappers and pipelines

Yep, software wrappers are the way to go. Combine these with a pipeline system and I think you have the answer to your quesiton. Unfortunately AFAIK implementations for biology applications are few.

The bioperl-pipeline project does look promising. It seems they are using XML deffinitions of pipelines and program parameters that are already wrapped by bioperl. I'm no big fan of perl but bioperl is quite solid cf. the other bioprojects.

--
www.tyrelle.net


Bioperl Pipeline

I belive too that bioperl-pipeline is promising, but like Bioperl the project lack of good documentation and is not straitforward even for Perl programmers like me.
Just for the record I use bioperl on my work, but they still rely on a JavaDoc like documentation. And I find this kind complex to use than some hand maded documentation.

:wq


Documentation: always the issue...

As I understand it (i.e. not that well) the pipeline project is just in the "begining" stages so I would not expect good documentation anytime soon. However, one of the best ways to understand a system is to document it. In most cases developers hate writing doucmentation (I don't mind writing acutally, more often than not it's a time issue) so I would suggest contacting the developers (The Fugu genome guys are doing the pipline project IIRC) and offer to write a beginers guide or how-to for their system. The offer of help will generally mean they will reciprocate and help you understand how their system works.

Developers are *very* helpful to people who are willing to write documentation. As a bonus you'll also get kudos and maybe fame, fortune and glory too ;]

By "hand made" documenation I gather you are refereing to tutorials/user manuals/how-tos etc. Bioperl dev's are in the process of putting together a series of how-to's.

_greg

--
.:www.tyrelle.net:.


BioPerl documentation

There's been some discussion of BioPerl documentation on their mailing lists lately. Everyone seems agreed it's an area that needs some work. A basic problem seems to be that the code is updated more often than the documentation, so they get out of sync.
Personally I quite like the web-based BioPerl docs, but I think they'd benefit from more example code, in the style of UNIX manpages. Often in the tutorial they only focus on a small number of objects and methods for each module. If you want to know more, you just have to read either module documentation in depth as the need arises (my approach) or spend time digesting the inner workings of the whole package (Greg's approach).


automatic logging

At one place I worked, all the analysis tools were replaced by wrapper scripts that would log the tool use. These logs could be used later to focus on improving the tools that got used most. I imagine the same could trick could be used to keep history of what was done.

If the logs contained the working directory, finding what happened on a given project would just be a matter of grepping through the log. Used with CVS, the system could tell you what version of the tool was used. The wrapper for blast, say, could be extended to save extra tool specific information like the database date or number of sequences processed.