Obtaining overrepresented motifs in DNA sequences, part 13

Phase 2, motifs No Comments »

Now that we have the best quorum determination function and the ideal function to calculate the binomial expansions it is easy to program a script to calculate the p value of motifs in DNA sequences. To the script

below in the code there are a couple of errors that wordpress don’t let me fix. The > and < are replaced by their literal html enconding. I am working on it, sorry

#!/usr/bin/env python

import fasta
import sys
from collections import defaultdict

def choose(n, k):
    if 0 &lt;= k &lt;= n:
        ntok = 1
        ktok = 1
        for t in xrange(1, min(k, n - k) + 1):
            ntok *= n
            ktok *= t
            n -= 1
        #print ntok // ktok
        return ntok // ktok
    else:
        return 0

def get_quorums(seqs, mlen):
    """
    add seq id_no to a set
    use explicit counter to create seq_no
    """
    quorum = defaultdict(set)
    id_no = 0
    for seq in seqs:
        id_no += 1
        for n in range(len(seq) - mlen):
            quorum[seq[n:n + mlen]].add(id_no)
    return quorum

input_seqs = fasta.read_seqs(open(sys.argv[1]).readlines())
input_seqs2 = fasta.read_seqs(open(sys.argv[2]).readlines())

foreground = get_quorums(input_seqs, 10)
background = get_quorums(input_seqs2, 10)

N = len(input_seqs) + len(input_seqs2)

for i in foreground:
    term1 = choose(len(background[i]), len(foreground[i]))
    term2 = choose((N - len(background[i])), len(input_seqs)-1)
    term3 = choose(N, len(input_seqs))
    p = (float(term1) * float(term2)) / term3
    if 0 &lt; p &lt;= 0.0001:
        print i, len(foreground[i]), len(background[i]), p

We already defined choose in the last post (more information in the link from the Python’s cookbook) and earlier Mike sent us a series of quorum-determination functions and one of the best was portrayed and explained here. We also need our fasta module to read the sequences (and only the sequences) in order to use it in the quorum function.

Basically we use the foreground and background files as input, determine the quorum of the different words (width 10) and then we iterate over the results, calculating the p value for each motif found in the foreground set. The tree terms of the Hypergeometric Distribution are calculated separately and we test for a p value smaller that 0.0001 (this can be modified) and we only print the results that fall in this category. Next time we will do some code refactoring and change a little bit this script and also explain it better.

Reblog this post [with Zemanta]

Obtaining overrepresented motifs in DNA sequences, part 12.5

Phase 2 2 Comments »

So let’s modify a little bit the factorial function and then benchmark both by using timeit. Ideally our factorial function would need to calculate a value similar to the binomial expansion, as we have three factorials to calculate in for each binomial in the Hypergeometric Distribution.

So we can add two extra factorial calculations to our function and perform the multiplication and division to return the equivalent to the binomial calculation. So the function would be

def fac(n, m):
    value1 = 1
    for i in xrange(2, n + 1):
        value1 *= i
    value2 = 1
    for i in xrange(2, m + 1):
        value2 *= i
    value3 = 1
    for i in xrange(2, (n - m) + 1):
        value3 *= i 

    return  value1 / (value2 * value3)

m and n are both values of the binomial and n - m is the subtraction of one by the other that forms the last factorial to be calculated. This way it makes easier to time the performance of both functions. In the end the complete script would look like

#!/usr/bin/env python

import timeit

def fac(n, m):
    result1 = 1
    for i in xrange(2, n + 1):
        result1 *= i
    result2 = 1
    for i in xrange(2, m + 1):
        result2 *= i
    result3 = 1
    for i in xrange(2, (n - m) + 1):
        result3 *= i 

    return  result1 / (result2 * result3) 

def binom(n, m):
    b = [0] * (n + 1)
    b[0] = 1
    for i in xrange(1, n + 1):
        b[i] = 1
        j = i - 1
        while j &gt; 0:
            b[j] += b[j - 1]
            j -= 1
    return b[m] 

def choose(n, k):
    if 0 <= k <= n:
        ntok = 1
        ktok = 1
        for t in xrange(1, min(k, n - k) + 1):
            ntok *= n
            ktok *= t
            n -= 1
        #print ntok // ktok
        return ntok // ktok
    else:
        return 0

if __name__ == "__main__":

    stmt = "fac(3000, 7)"
    t = timeit.Timer(stmt = stmt, setup='from __main__ import fac')
    stmt2 = "binom(3000, 7)"
    t2 = timeit.Timer(stmt = stmt2, setup = 'from __main__ import binom')
    stmt3 = "choose(3000, 7)"
    t3 = timeit.Timer(stmt = stmt3, setup = 'from __main__ import choose')

    print 'fac: %.9f' % (t.timeit(100)/100)
    print 'binom: %.2f' % (t2.timeit(10)/10)
    print 'choose %.9f' % (t3.timeit(100)/100)

The final result of the average for ten repetitions is as follow

fac = 0.10 s
binom = 43.24 s
choose = 0.000005 s

Clearly, the factorial function gets a huge advantage over the binomial one. So we will modify it a little bit and use it for our HD script. Clearly the choose function is the fastest one, so we will incorporate it on out HD script.

Reblog this post [with Zemanta]

Obtaining overrepresented motifs in DNA sequences, part 11

Phase 2 6 Comments »

After a long hiatus we are (almost) back on track in order to get our scripts to determine overrepresented motifs in DNA sequences. Last time we checked we defined the “best” factorial function in Python

def fac_01(n):
    result = 1
    for i in xrange(2, n+1):
        result *= i
    return result

and Andrew Dalke sent a couple of links pointing out to a binomial calculation function, one of them is below

# This file contains the Python code from Program 14.10 of
# "Data Structures and Algorithms
# with Object-Oriented Design Patterns in Python"
# by Bruno R. Preiss.
#
# Copyright (c) 2003 by Bruno R. Preiss, P.Eng. All rights reserved.
#
# http://www.brpreiss.com/books/opus7/programs/pgm14_10.txt
#
def binom(n, m):
    b = [0] * (n + 1)
    b[0] = 1
    for i in xrange(1, n + 1):
        b[i] = 1
        j = i - 1
        while j > 0:
            b[j] += b[j - 1]
            j -= 1
    return b[m]

There is a similar implementation in the Python Cookbook online and it is clearly more convenient using this function than actually coding to calculate an identical value using factorials. But anyway, let’s see how a function that calculates one binomial coefficient from the Hypergeometric Distribution (HD) by using the factorial function. Later we benchmark both methods.

Each binomial coefficient can be expanded as

and the HD has three of them. From the Wikipedia “In probability theory and statistics, the hypergeometric distribution is a discrete probability distribution that describes the number of successes in a sequence of n draws from a finite population without replacement.” In the next post we will define each term of the HD for the motifs case.

In each binomial expansion we would have to calculate three factorial values, nine in total. With the binomial function, only three values need to be calculated. So, using the factorial function we would need to code something like this in order to calculate one of the binomials


#let's say the motif quorum in the foreground is 7
#and the total number of sequences is 3000
#we won't touch the other required values
fore = 7
total = 3000
hd = fac_01(3000) / (fac_01(7) * fac_01(2993)
print hd

Next, we will benchmark and see if there is an advantage to either method.

Reblog this post [with Zemanta]

Test from Zoundry Raven

off topic No Comments »

I am testing a offline/desktop bloggin tool, called Zoundry Raven. New posts are on the way, as promised.

Python for Bioinformatics: a room on FriendFeed

Phase 2 No Comments »

FriendFeed's homepageImage via WikipediaA couple of days ago a Python for Bioinformatics room was created on FriendFeed. From all the Bioinformatics-related computer languages groups is the one with the highest number of subscribers. Anyone is welcomed to join and contribute to the room. The room has limited resources so far, but we are hoping it grows a lot in the next few days.

Come join us.

Zemanta Pixie

BPB on OWW

Phase 2 No Comments »

OpenWetWare logo, designed by Jennifer Cook-Ch...Image via WikipediaThe contents of this site/blog are being (slowly) transferred to a wiki-like page on Open Wet Ware. Thanks to Ricardo Vidal that started including the articles and was the person responsible for this initiative (and a great artist, illustrator).

What is the OWW?

OpenWetWare is an effort to promote the sharing of information, know-how, and wisdom among researchers and groups who are working in biology & biological engineering. Learn more about us.
If you would like edit access, would be interested in helping out, or want your lab website hosted on OpenWetWare, please join us.

Zemanta Pixie

New posts

off topic 3 Comments »

CPythonImage via WikipediaPosts will resume next week. I was off on vacation and then I changed jobs, so there was no time for Python fun.

Zemanta Pixie

Bioinformatics career survey

Uncategorized No Comments »

Via Bioinformatics Zen:

Zemanta Pixie

A quick assessment of factorial functions in Python

Phase 2 10 Comments »

A short pause on the motifs subject. As mentioned on comments there is a lot of different ways of calculating factorials in Python (the same “problem” can be found in some other languages too). Cariaso suggested to time the execution of different factorial functions, including the ones found on Python’s cookbook (which I should have included in the beginning of last post). Anyway all functions from the webpage were included, as the one mentioned on a comment and both functions seen here. Using timeit (thanks Mike!) the execution time of all of them were measured by calculating the factorial of 800 and 4000. First, the functions:

def fac_01(n):
    result = 1
    for i in xrange(2, n+1):
        result *= i
    return result

def fac_02(n):
    value = reduce(lambda i, j : i * j, range(1, n + 1))
    return value

def fac_03(n):
    import operator
    value = reduce(operator.mul, xrange(2, n + 1))
    return value

def fac_04(n):
    fac = lambda n:n-1 + abs(n-1) and fac(n-1)*long(n) or 1
    return fac(n)

def fac_05(n):
    fac = lambda n:[1,0][n&gt;0] or fac(n-1)*n
    return fac(n)

def fac_06(n):
    fac = lambda n:reduce(lambda a,b:a*(b+1),range(n),1)
    return fac(n)

def fac_07(n):
    fac=lambda n: [1, 0][n &gt; 0] or reduce(lambda x, y: x*y, xrange(1,n + 1))
    return fac(n)

def fac_08(n):
    fac = lambda n: n &lt;= 0 or reduce(lambda a, b: a*b, xrange(1,n + 1))
    return fac(n)

def fac_09(n):
    fac = lambda n: [[[j for j in (j * i,)][0] for i in range(2, n+1)][-1] for j in (1,)][0]
    return fac(n)

def fac_10(n):
    fac = lambda n: [j for j in [1] for i in range(2, n+1) for j in [j * i]] [-1]
    return fac(n)

fac_01 was suggested by bearophile on a comment (no psyco import here), fac_02 was the first one we have seen in the blog, fac_03 was the one “selected as the fastest” (mentioned on comments too) and functions 4 to 10 were gathered from the cookbook, all of them using lambda. Now, to the results (no fancy graphs, yet), average over 1000 repeats, time in seconds:

800!
fac_01: 0.0010
fac_02: 0.0012
fac_03: 0.0010
fac_04: 0.0022
fac_05: 0.0020
fac_06: 0.0015
fac_07: 0.0013
fac_08: 0.0013
fac_09: 0.0017
fac_10: 0.0015

4000!
fac_01: 0.0226
fac_02: 0.0242
fac_03: 0.0230
fac_04: N/A
fac_05: N/A
fac_06: 0.0244
fac_07: 0.0239
fac_08: 0.0241
fac_09: 0.0380
fac_10: 0.0362

In both cases the “best” functions were 1 and 3. For the smallest factorial (800!) 1 and 3 didn’t have a lot of advantage for 2, 6-10, while 4 and 5 were 2 times slower. For the largest factorial (4000!) 1 was 4% better than 3, and the third best (6) was also 4% slower than 3. Why do 4 and 5 have no computed time? Because they are recursive and it blows the recursion limit in Python.

So I guess it is a safe bet to use either function 1 or 3, with a slightly advantage for 3 in the case of large factorials. But at the same time a believe for our case of overrepresented motifs, it would be better to pre-calculate the factorials, store them and access the value when needed. Andrew’s idea of using binomials is also interesting, and I am planning to test it here. He also gave a couple of other suggestions that would need extra imports, scipy and gmpy.

Zemanta Pixie

Obtaining overrepresented motifs in DNA sequences, part 10

Phase 2, motifs 6 Comments »

Let’s get back to the statistical module, that will calculate an Hypergeometric Distribution (HD) p value so we can define the overrepresented motifs. Last time we saw it, we just had defined the factorial function, which is immensely helpful in this case due to the number of factorial calculations needed in the HD. The factorial function was the one below


def fac(n):
    value = reduce(lambda i, j : i * j, range(1, n + 1))
    return value

but as mentioned in the comments by Dave and by Mike via email the method used is not the best method to calculate factorial in Python. The best approach in this case is to use operator.mul. All functions in the operator modules are in implemented in pure C and they mimic the same operators in Python. So in this module we can find mul for multiplication, sub for subtraction, add for additions, etc.

The operator.mul needs two arguments to multiply, and in our case we still need to use reduce to sum all the results from a series of multiplications. As parameters we should use a range, that can start with 2, that should go up to the number we want the factorial plus one. Finally our function would be


import operator

def fac(n):
    value = reduce(operator.mul, xrange(2, n+1))
    return value

The time gain, quickly measured in a non-scientific fashion in my system, is around 5 to 15%, depending on the factorial being calculated. It may seem a small gain, but when you need to calculate almost a million factorials for all possible motifs the amount of time saved is crucial. Next time we will be back with more statistics, expanding the module.

Zemanta Pixie
Design by j david macor.com.Original WP Theme & Icons by N.Design Studio
Entries RSS Comments RSS Log in