Munch Lab

Base composition of HIV sequences

In this exercise you will work on a HIV-1 sequence. We want to compute the frequency of the four nucleotides in the gene. First download the module exerciseWeek3.py, if you did not do that already, and put in the folder where you keep your other python code. Importing it like this:

import exerciseWeek3

will allow you to access the HIV sequence as a the string exerciseWeek3.hivSeq.

Counting bases

Write a function, countBases(seq), that, given a DNA string seq returns a dictionary that maps each base to the number of occurrences of that base in seq.

Example usage:

print(countBases("ACTGGCCCT"))

{‘A’: 1, ‘C’: 4, ‘T’: 2, ‘G’: 2}
Then try it out on your HIV sequence and see what you get.

Computing frequencies

Write a function, print(baseComposition(seq)), that given a DNA string seq prints a nice table with the proportion of each base. Call countBases(seq) from within printBaseComposition(seq) to count the bases.

Example usage:

print(baseComposition("ACTG"))

A: 0.25
C: 0.25
T: 0.25
G: 0.25

Now lets work on more data

Now download this file:

hivsequences.txt

Once you have downloaded the file you can open it in Sublime Text (yes with works for all kinds of text files) and see what it looks like. Each line in the sequence file has a sequence name
followed by a space followed by a sequence like this. Like this but with longer sequences
ofcause.

HIV1.A.1 ATGGGTGCGAGAGCGTCAATATTAAGCGGGGGAAGATTAG...
HIV1.A.2 TGGAAGGGCTAATTTACTCCAAGAAAAGACAAGACATCCT...
HIV1.A.3 TGGATGGGTTAATTTACTCCAAGAAAAGGCAAGAAATCCT...
HIV1.A.4 TTGAAAAGCGAAAGTAACAGGGACTCGAAAGCGAAAGTTC...
HIV1.B.1 TGGAAGGGCTAATTCACTCCCAACGAAGACAAGATATCCT...
HIV1.B.2 GAGCCTGGGAGCTCTCTGGCTAGCTGGGGAACCCACTGCT...
HIV1.B.3 GGACCTGAAAGCGAAAGAGAAACCAGAGGAGCTCTCTCGA...
HIV1.B.4 GCGTCAGTATTAAGCGGGGGAAAATTAGATACATGGGAGA...
HIV1.C.1 GACTTGAAAGCGAAAGTAAGACCAGAGGAGATCTCTCGAC...
HIV1.C.2 AAATCTCTAGCAGTGGCGCCCGAACAGGGGACCTGAAAGC...
HIV1.C.3 AAATCTCTAGCAGTGGCGCCCGAACAGGGACCTGAAAGCG...
HIV1.C.4 TCTGTTGTGTGACTCTGGTAACTAGAGATCCCTCAGACCT...
HIV1.D.1 GGTCTCTCTGGTTAGACCAGATTTGAGCCTGGGAGCTCTC...
HIV1.D.2 GCGAGAGCGTCAATATTAAGCGGGGGAAAATTGGATGCAT...
HIV1.D.3 GCGAGAGCGTCAGTATTAAGCGGGGGACAATTAGATGCAT...
HIV1.D.4 CTGAAAGCGAAAGTAGAACCAGAGGAGATCTCTCGACGCA...

Understand and explain in detail what the code below does and how. Hint: copy it into your
editor so you can add print the values of different variables. The open() function is described in the Files chapter of the book. Consider this a primer for next week where we will talk more about files.

hivFile = open("hivsequences.txt", 'r')
statistics = {}
for l in hivFile:
    name, seq = l.split()
    if name not in statistics:
        statistics[name] = {}
    for b in seq:
        if b not in statistics[name]:
            statistics[name][b] = 0
        statistics[name][b] += 1
hivFile.close()

for name in statistics:
    print name
    total = sum(statistics[name].values())
    for b in statistics[name]:
        print "\t", b, statistics[name][b] / float(total)

Solutions to exercise

Not available yet

3 thoughts on “Base composition of HIV sequences”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: