maximilianh's blog

Roundup: Extract a sequence from a fasta file

HMMs, SVMs, MCMC - interesting topics! But I have simple problems. Here is one: How do I extract some sequences from a fasta file if my a accession number are not Genbank ids themselved but other words that are still in the header?

Materials:
Hm. Pubmed won't export so many sequences from the web interface (at least I could not find a way, limit is 100) If I was a biologist, I would probably repeat the process manually 70 times to get 70 * 100 sequences. Which might have actually saved me a lot of time. But I wanted to be clever.


Are wetlabs career killers?

I've noticed that a couple of people here describe themselves as "computer guys" working for "biologists" (just like me). I wonder if there is any one well-known like this, a computer scientist that started in a biological lab and who was successful enough in doing this to start his own group later on. Someone like a role model for computer guys that do something that "is rather desperate". If you think that a weblab is not a career-killer for bioinformatics than I would appreciate a concrete counter-example.


A pipeline is a makefile

What is a pipeline? For me, it' s series of steps that munch DNA/protein data, combines it with other data using various small scripts and outputs the results as diagrams or HTML. Do we want to code this kind of software as a script? If you think "makefile!" now, then you're much more clever than I was. But personally, until recently, I've glued my scripts together using other scripts. And used makefiles only for compiling my programs. That was a bad idea. (it's a quite detailed post, click on "read more" for the full article)


High-throughput pyro-sequencers for the price of a PCR machine in three years

I knew that sequencers are getting cheaper all the time but in genome technology this week they're talking with the inventors of the technology that 454 is licencing, discussing future pyro-sequencing updates and how that should lead to very cheap machines. Prospects are that any lab can sequence it's own genome in three years, the technology seems almost ready: Basically a cheaper and smaller version of 454's current machines. If you believe that sequence databases are exploding at the moment, better prepare for a new wave.


Real-world workshops: Regcreative

I've recently been to a cool workshop called "RegCreative". The idea was to mass-curate papers into a new database. There we have the usual discussion "open" (Oreganno) versus "private" (Transfac) databases and the open one is this case is still far from big enough, but that's not my main point here.
I liked the workshop because we were actually spending a lot of time at the computer and reading papers. There were no big stars, impressive results, great publications and hypothesizes, mainly people that presented their own databases ("I've spent 500 hours to create my database" (flytf), "I read 120 papers) (flyreg), etc...) and then afterwards everyone would get back to their computers, trying to put in one of the papers from the big pile at the entrace. The problem of database curation became very obvious to all participants and they got more tired of reading papers with every day that passed... (Here is a picture from the beginning, when people were still discussing :-)


10.000 times faster blast, in JAVA???

Given that the backbone of sequence analysis and as such bioinformatics is alignment, this news has the potential to shake some ground in the community: There is a new company claiming that their new BLAST is as sensitive but 10.000 times faster than the original by improving the seed searching phase. I'm very sceptical... local alignments have been worked on for decades without a lot of speed improvement while keeping accuracy. Which is why there are special hardware solutions for this problem (I wonder who is buying them).
But the company's website does not look like a joke and they're shipping demo versions... I have no clue how that could work. Suffix arrays? But that has been applied to local alignment seed search, right? Anyone out there with rumors about this company ?


The value of bad code

Computer scientists often have a hard time getting used to bioinformatics people. They are trained to write readable, documented, non-spagetti-code software, trained for years, they were told about UML diagrams and some even had special courses for it, the once hyped "software engineering". We're taught that this is the most important part of computer science all the time and that losses amount to billions every year because of poorly written software.

In science, everything is different, of course. Documentation does not matter.


Syndicate content