Analyzing my Writings with Python

*I spent the day* in the lab with some random time on my hands between adding reagents to an ongoing immunohistochemical reaction I was performing. At one point I decided to further investigate the field of bioinformatics (is it worth seeking a PhD in this field if I don’t get into dental school again?). UCF offers a PhD in bioinformatics but it’s a new and small department (I think there are only 4 faculty). The degree itself is a degree in computer science (the logic side of computers, more programming than designing hardware). A degree in bioinformatics combines molecular biology (DNA, proteins, etc), computer science (programming), and statistics (developing code to analyze biological data). I feel a need to express what it is, because it’s not something that is commonly understood. Do you know what people who study bioinformatics do?

*I came across a paper* today “Structural Alignment of Pseudoknotted RNA”:http://cseweb.ucsd.edu/users/shzhang/app/RECOMB2005_pseudoknot.pdf (by Han B, Dost B, Bafna V, and Zhang S.) which is a good example of the practice of bioinformatics. Think about what goes on in a cell… the sequence of a gene (a short region of DNA) is copied (letter-by-letter) onto an RNA molecule. The RNA molecule is later read by an enzyme (called a ribosome) and converted into a protein based on its sequence. (This process is the central dogma of molecular biology) Traditionally, it was believed that RNA molecules’ only function was to copy gene sequences from DNA to ribosomes, but recently (the last several years) it was discovered that some small RNA molecules are never read and turned into proteins, but rather serve their own unique functions! For example, some RNA molecules (siRNAs) can actually turn genes on and off, and have been assosiated with cancer development and other immune diseases. Given the human genome (the ~3 billion letter long sequence all of our DNA), how can we determine what regions form these functional RNA molecules which don’t get converted into proteins? The paper I mentioned earlier addresses this. An algorithm was developed and used to test regions of DNA and predict its probability of forming small RNA molecules. Spikes in this trace (figure 7 of the paper) represent areas of the DNA which are likely to form these RNA molecules. (Is this useful? What if you were to compare these results between normal person and someone with cancer?)

*After reading the article* I thought to myself “Hmmm… logically manipulating large amounts of linear data… why does this seem familiar?” Then I realized how similar my current programming projects are with this one. (see “my latest DIY ECG data”:http://www.swharden.com/blog/images/ecg_goodie.png posted a couple days ago) Consider the trace (pictured, figure 7 in “Structural Alignment of Pseudoknotted RNA”:http://cseweb.ucsd.edu/users/shzhang/app/RECOMB2005_pseudoknot.pdf) of score (the likelihood that a region of DNA forms an RNA molecule), where peaks represent likely locations of RNA formation. Just generate the trace, determine the positions of the peaks, and you’re golden. How similar is this to the work I’ve been doing with my homemade ECG machine, where I perform signal analysis to eliminate electrical noise and then analyze the resulting trace to isolate and identify peaks corresponding to heartbeats?

*After reading* I shivered from mental-overload. There are so many exciting Python projects in the field of bioinformatics that are just waiting for me to begin work on! I know I’m like a child sometimes, but hey it’s my personality. I get excited. It’s just that I get excited about tacky things these days. Anyway, I got the itch to write a string-analysis program. What does it do? It reads the content of my website (exported in the form of a SQL backup query generated by PHPmyAdmin, pictured), splits it up by date, and allows for its analysis. Ultimately I want to track the usage of certain words (i.e.: the inverse relationship between the words “girls” and “python”), but for now I wrote a script which plots the number of words I wrote. Observe the output.

*Pretty cool huh?* Check out all those spikes between 2004 and 2005! (previous figure) Not only are they numerous (meaning many posts), but they’re also high (meaning many words per post). As you can see by the top trace, the most significant contribution to my site occurred during this time. So, let’s zoom in on it! (next figure)

*And of course, the code to produce this…* (obviously you have to have a wordpress backup SQL file in the same folder – if you want mine let me know and I’ll email it to ya’)

import datetime, pylab, numpy
# Let's convert SQL-backups of my WordPress blog into charts! yay!
class blogChrono():
    baseUrl="http://www.SWHarden.com/blog"
    posts=[]
    dates=[]
    def __init__(self,fname):
        self.fname=fname
        self.load()
    def load(self):
        print "loading [%s]..."%self.fname,
        f=open(self.fname)
        raw=f.readlines()
        f.close()
        for line in raw:
            if "INSERT INTO" in line
            and';' in line[-2:-1]
            and " 'post'," in line[-20:-1]:
                post={}
                line=line.split("VALUES(",1)[1][:-3]
                line=line.replace(', NULL',', None')
                line=line.replace(", '',",", None,")
                line=line.replace("''","")
                c= line.split(',',4)[4][::-1]
                c= c.split(" ,",21)
                text=c[-1]
                text=text[::-1]
                text=text[2:-1]
                text=text.replace('"""','###')
                line=line.replace(text,'blogtext')
                line=line.replace(', ,',', None,')
                line=eval("["+line+"]")
                if len(line[4])>len('blogtext'):
                    x=str(line[4].split(', '))[2:-2]
                    raw=str(line)
                    raw=raw.replace(line[4],x)
                    line=eval(raw)
                post["id"]=int(line[0])
                post["date"]=datetime.datetime.strptime(line[2],
                                                        "%Y-%m-%d %H:%M:%S")
                post["text"]=eval('"""'+text+' """')
                post["title"]=line[5]
                post["url"]=line[21]
                post["comm"]=int(line[25])
                post["words"]=post["text"].count(" ")
                self.dates.append(post["date"])
                self.posts.append(post)
        self.dates.sort()
        d=self.dates[:]
        i,newposts=0,[]
        while len(self.posts)>0:
            die=min(self.dates)
            for post in self.posts:
                if post["date"]==die:
                    self.dates.remove(die)
                    newposts.append(post)
                    self.posts.remove(post)
        self.posts,self.dates=newposts,d
        print "read %d posts!n"%len(self.posts)

#d=blogChrono('sml.sql')
d=blogChrono('test.sql')

fig=pylab.figure(figsize=(7,5))
dates,lengths,words,ltot,wtot=[],[],[],[0],[0]
for post in d.posts:
    dates.append(post["date"])
    lengths.append(len(post["text"]))
    ltot.append(ltot[-1]+lengths[-1])
    words.append(post["words"])
    wtot.append(wtot[-1]+words[-1])
ltot,wtot=ltot[1:],wtot[1:]

pylab.subplot(211)
#pylab.plot(dates,numpy.array(ltot)/(10.0**6),label="letters")
pylab.plot(dates,numpy.array(wtot)/(10.0**3),label="words")
pylab.ylabel("Thousand")
pylab.title("Total Blogged Words")
pylab.grid(alpha=.2)
#pylab.legend()
fig.autofmt_xdate()
pylab.subplot(212,sharex=pylab.subplot(211))
pylab.bar(dates,numpy.array(words)/(10.0**3))
pylab.title("Words Per Entry")
pylab.ylabel("Thousand")
pylab.xlabel("Date")
pylab.grid(alpha=.2)
#pylab.axis([min(d.dates),max(d.dates),None,20])
fig.autofmt_xdate()
pylab.subplots_adjust(left=.1,bottom=.13,right=.98,top=.92,hspace=.25)
width=675
pylab.savefig('out.png',dpi=675/7)
pylab.show()

print "DONE"

*I wrote a Python script to analyze the word frequency* of the blogs in my website (extracted from an SQL query WordPress backup file) for frequency. “This is what I came up with”:http://swharden.com/little/worddump.html I then took my giant list over to “Wordie”:http://www.wordle.net/create and had them create a super-cool little word jumble. Neat, huh? Here’s “a picture”:http://www.SWHarden.com/blog/images/wordie2.png that’s cool but not worth posting.

*This is the script to make the worddump:*

import datetime, pylab, numpy
f=open('dump.txt')
body=f.read()
f.close()
body=body.lower()
body=body.split(" ")
tot=float(len(body))
words={}
for word in body:
    for i in word:
        if 65< =ord(i)<=90 or 97<=ord(i)<=122: pass
        else: word=None
    if word:
        if not word in words:words[word]=0
        words[word]=words[word]+1
data=[]
for word in words: data.append([words[word],word])
data.sort()
data.reverse()
out= "<b>Out of %d words...n"%tot
xs=[]
for i in range(1000):
    d=data[i]
    out += '<b>"%s"</b> ranks #%d used <b>%d</b> times (%.05f%%)
n'%
                (d[1],i+1,d[0],d[0]/tot)
f=open("dump.html",'w')
f.write(out)
f.close()
print "DONE"