Munch Lab

Exercise: Codon usage bias


This exercise is a continuation of the previous exercise about open reading frames. Codon usage bias refers to differences in the frequency of occurrence of synonymous codons in coding DNA. A codon is a series of three nucleotides (triplets) that encodes a specific amino acid residue in a polypeptide chain or for the termination of translation (stop codons). There are 64 different codons (61 codons encoding for amino acids plus 3 stop codons) but only 20 different translated amino acids. The overabundance in the number of codons allows many amino acids to be encoded by more than one codon. Because of such redundancy it is said that the genetic code is degenerate. Different organisms often show particular preferences for one of the several codons that encode the same amino acid, that is, a greater frequency of one will be found than expected by chance. How such preferences arise is a much debated area of molecular evolution.

Outlining the problem

Given an open reading frame like the ones you found in the previous exercise, you must compute the codon usage. Your goal is to create a dictionary with keys corresponding to amino acids (i.e. single letter strings designating an amino acid such as “R” for arginine).

The value associated with each amino acid key should also be a dictionary, and the keys of this dictionary should be the different codons that encode the amino acid and the value associated with each key should be the relative frequency with which that codon is used to encode that amino acid.

You should only include information about the different amino acids that are used in your ORF data. Also, you should handle both uppercase and lowercase input sequences. Once you are done it should look somewhat like this:

{'A': {'GCA': 0.0, 'GCC': 0.0, 'GCT': 1.0, 'GCG': 0.0}, 
'C': {'TGC': 0.0, 'TGT': 1.0}, 'E': {'GAG': 0.33333333333333331, 
'GAA': 0.66666666666666663}, 'D': {'GAT': 1.0, 'GAC': 0.0}, 
'G': {'GGT': 0.33333333333333331, 'GGG': 0.0, 
'GGA': 0.66666666666666663, 'GGC': 0.0}, 'F': {'TTC': 0.0, 
'TTT': 1.0}, 'I': {'ATT': 1.0, 'ATC': 0.0, 'ATA': 0.0}, 
'H': {'CAC': 0.0, 'CAT': 1.0}, 'K': {'AAG': 0.20000000000000001,
 'AAA': 0.80000000000000004}, '*': {'TAG': 0.0, 'TGA': 1.0, 
'TAA': 0.0}, 'M': {'ATG': 1.0}, 'L': {'CTT': 0.0, 
'CTG': 0.66666666666666663, 'CTA': 0.0, 'CTC':0.0, 'TTA':
 0.33333333333333331, 'TTG': 0.0}, 'N': {'AAT': 0.5, 
'AAC': 0.5}, 'Q': {'CAA': 0.59999999999999998, 
'CAG': 0.40000000000000002}, 'P': {'CCT': 0.5, 'CCG': 0.0,
 'CCA': 0.5, 'CCC': 0.0}, 'S': {'TCT': 0.0, 'AGC': 0.0, 
'TCG': 0.0, 'AGT': 0.5, 'TCC': 0.0, 'TCA': 0.5}, 
'R': {'CGA': 0.33333333333333331, 'CGC': 0.0, 
'AGA': 0.33333333333333331, 'AGG': 0.0, 'CGG': 0.0,
'CGT': 0.33333333333333331}, 'T': {'ACC': 0.0, 'ACA': 0.0, 
'ACG': 0.0, 'ACT': 1.0}, 'W': {'TGG': 1.0}, 'V': 
{'GTA': 0.66666666666666663, 'GTC': 0.0, 'GTT': 0.16666666666666666, 
'GTG': 0.16666666666666666}, 'Y': {'TAT': 1.0, 'TAC': 0.0}}

This does not look to pretty. So our ultimate goal is to print so it is easily read – like this:

    GCA: 0%
    GCC: 0%
    GCT: 100%
    GCG: 0%
    TGC: 0%
    TGT: 100%
    GAG: 33%

Before you read on think about how to break up this task into smaller functions that are more easily implemented.

In this exercise we go through one of the many ways to do this. We split the solution to the problem into five functions, two of which you have already implemented.

  • split_codons(orf) that produces a list of codons in an ORF.
  • count_codons(orf) that produces a dictionary of the codons used in the orf, mapping to the number of times each codon has been used.
  • group_counts_by_aa(counts) that takes the dictionary of all codon counts that is produced by count_codons(orf) and then groups the counts by the amino acid the codons encode. The result is a dictionary of dictionaries like the one shown above, but with counts of codons – not the frequencies as in the example above.
  • normalize_counts(d) that takes the dictionary produced by codon_counts_per_aa(orf) and replaces the counts for each amino acid with frequencies.
  • compute_codon_bias(orf) that uses the functions above to make the normalised dictionary of dictionaries that we need.
  • print_codon_bias(orf) that prints the dictionary returned by findCodonBias(orf) in a nice way.


You will need to download unless you did so in the previous exercise. If you put the file where you have your other code, then you can import variables with the codon_map and the start and stop codons into your script by putting this at the top of your file where you write our own code:

from codon_translation import codon_map

You also need to down load some open reading frames to test your code on. You can download a file
sample_orfs.txt. If you put the file where you have your other code then you can use this bit of code to read the ORFs into a list:

f = open('sample_orfs.txt', 'r')
orf_list = list()
for line in f:
seq = line.strip()

Try to print the list to see it. Then pick out the first ORF in the list so you can use that to test your code:

test_orf = orf_list[0]

Split the ORF into codons

The first one you have already implemented in the previous exercise – and if not here it is in my version to get you started.

def split_codons(orf):
codon_list = []
for i in range(0, len(orf)-2, 3):
return codon_list

Count codons in an ORF

Write a function count_codons(orf) that counts the the occurrences of codons in the ORF. You need to use the split_codons function to split the ORF into a list of codons. Then you can create an empty dictionary that you can populate with counts. You want all the possible codons to be in your dictionary, so that the codons you do not find in your ORF will a count of 0. That means that you must start by filling the dictionary with a key for each codon and give each a count of 0. You can do that by iterating over the keys in codon_map. Then you must iterate over all the codons in the list of codons produced by the split_codons function, and add counts to the dictionary as you go along. We treated a very similar example of this accumulator pattern in a lecture.

Remember to “uppercase” you off so that count_codons("atgATGTAA") returns:

{'ATG': 2, 'TAA': 1}

and not:

{'ATG': 1, 'tag': 1, 'TAA': 1}

Group codon counts by amino acid

Write a function group_counts_by_aa(codon_counts) that takes as argument the dictionary returned by count_codons(orf). It must return a dictionary that pairs each of the 20 amino acids with a dictionary with counts of how many times each codon is used to encode that amino acid. Once you have made this data structure (and say you call it grouped_counts) you can access each count like this:


and since you can already compute a dictionary with codon counts like this

codon_counts = count_codons(orf)

the task is really to distribute those counts into groups for each amino acid like this (but not literally like this):

grouped_counts[acid][codon] = codon_counts[codon]

Make sure you understand this before you go on.

Your function must start by defining an empty dictionary to add to. Use count_codons(orf) to get a dictionary of all codon counts. Then use the .items() dictionary method to produce all the (key, value) pairs in codon_map so you can loop over all the (key, value) pairs like this:

def group_counts_by_aa(counts):
counts_by_aa = {}
for codon, acid in codon_map.items():
# from here you are on your own...

Your dictionary should not include amino acids that are not in the ORF, but you want all the possible codons for each amino acid represented in your nested dictionaries. Your result would not correctly represent codon bias if the codons that where not used was not included. We include the codons that we did not see in the ORF by looping over the codon_map you imported earlier.

Normalize counts

You need to normalize the counts so they become frequencies (i.e. count/total). Write a function normalize_counts(codon_counts_by_aa) that takes a dictionary of dictionaries (as returned by group_counts_by_aa(off) ) and produces a new dictionary of dictionaries that has frequencies in stead of counts. The frequencies for codons that encode the same amino acid must sum to one.  If the input dictionary is called codon_counts_by_aa the dictionary with counts for the ‘V’ amino acid was (codon_counts_by_aa['V']) was:

{'GTA': 4, 'GTC': 0, 'GTT': 1, 'GTG': 1}

in the input dictionary, then it should be:

{'GTA': 0.6666666666666666, 'GTC': 0.0, 'GTT': 0.16666666666666666, 'GTG': 0.16666666666666666}

in the output dictionary.

To do that you should first produce a dictionary with amino acids as keys and empty dictionaries as values. Then you can loop over the and keys of codon_counts_by_aa and its inner dictionaries holding the counts:

for aa in codon_counts_by_aa:
for codon in codon_counts_by_aa[aa]:

You need to compute the total counts for each amino acid too so you can produce each frequency like this:

codon_freq_by_aa[aa][codon] = codon_counts_by_aa[aa][codon] / float(total)

Compute the codon bias

Write a function compute_codon_bias(off) that uses count_codons(off), group_counts_by_aa(orf) and normalize_counts(d) to make the normalised dictionary of dictionaries that was showed in the beginning of the exercise. It should look something like this:

def find_codon_bias(orf):
codon_counts = count_codons(orf)
stats_per_aa = group_counts_by_aa(codon_counts)
return stats_per_aa

Print it nicely

Write a function print_codon_bias(orf) that prints the dictionary returned by compute_codon_bias(orf) in a nice way.  This is done using two nested for loops, the first one iterating over amino acids and the second iterating over the codons for each amino acid.

    GCA: 0%
    GCC: 0%
    GCT: 100%
    GCG: 0%
    TGC: 0%
    TGT: 100%
    GAG: 33%

Solution to exercise

You can download the solution form the last table on the course front page.

Applied programming 2014 week six


At both lectures we will talk about how to break up a large problem into a set of smaller and problems that are more easily solved.

Computer exercises

The computer exercise for this week was once an exam set. So if you think it is difficult at this point, do not despair. It is about finding open reading frames in bacteria sequences. There is a link to the exercise in the outline table on the course page.

Reading material

None for this week. Spend the extra time going through each separate topic we have been through to make sure you understand each part well enough. Otherwise it will become too confusing when we start to put all the things together.

Weekly assignment

For the assignment you must write four functions. Most of them can be implemented with list comprehensions if you have the guts.

Write a function square(L) that takes a list of numbers and returns the list of those
numbers squared.

Example usage:

print square([2, 5, 3, 6])

should print:
[4, 25, 9, 36]

Write a function even(L) that takes a list of numbers and returns a list of only those
numbers that are even. Remember that you can test if a number is even using n % 2 == 0.

Example usage:

print even([2, 5, 3, 6])

should print:
[2, 6]

Write a function, differences(xs,ys) that takes two lists of equal length, xs and ys, and
returns a list of the pairwise differences. One way to solve it is similar to the way you
did the pairwiseDifferences function in last weeks exercise. An other way is to use the
built in function zip(xs,ys) to get a list of all the matching pairs of xs and ys to
iterate through (look at the documentation yourself).

Example usage:

print differences([2, 5, 3, 6], [1, 5, 6, 3]) 

should print:
[1, 0, -3, 3]

Write a function squaredDifferences(xs, ys) that takes two lists of equal length, xs and
ys, and returns a list of the pairwise differences squared. You can use the
differences(xs,ys) function to make it easier to write this function.

Example usage:

print squaredDifferences([2, 5, 3, 6], [1, 5, 6, 3]) 

should print:
[1, 0, 9, 9]

Exercise: Genome assembly

This is a tricky exercise. Make sure you read the entire exercise and understand what you are supposed to do – before you start coding.

In genome assembly many short sequences (reads) from a sequencing machine are assembled into long sequences – ultimately chromosomes. This is done by ordering overlapping reads so they together represent genomic sequence. For example given these three reads: AGGTCGTAG, CGTAGAGCTGGGAG, GGGAGGTTGAAA ordering them based on their overlap like this


produces the following genomic sequence:


So very briefly, the task is to read in the sequence reads, identify overlaps between them, find the right order of reads and then reconstruct the genomic sequence from the overlapping reads.

Real genome assembly is of course more sophisticated than what we do here, but the idea is the same. Here we make two important simplifying assumptions.

  1. There are no sequencing errors implying that overlaps between reads are perfect matches.
  2. No reads are nested in other reads.

The second assumption imply that overlaps are always of this type:


and never this type:


In this exercise you will be asked to write functions that together solve the problem of assembling a genomic sequence. Some of the functions each solves a small problem, and you will use these inside other function you write, to put together solutions to larger subproblems.

The sequence reads for the mini assembly are in, genome_assembly.txt, that you need to download. The format of this file is like this:

1 ATGCG...
2 AGGCG...
3 TGAAG...

Each line represents a read. The first field on each line is the name of the read and the second field is the read sequence itself. So for the first line '1' is the name and 'ATGCG...' is the sequence.

Problem 1

The first task is to read and parse the input data. Write a function

def readDataFromFile(fileName):

that takes a string fileName as argument. To read in genome_assembly.txt you need to put it in the same directory (folder) as the file with python code you are writing. The function must return a dictionary where keys are read names and values are the associated read sequences. Both keys and values must be strings.

Example usage:


should return a dictionary with the following content (maybe not with key-value pairs in that order)


Problem 2

Often there are too many reads to look at them manually. We want to know what the mean length of the reads is.

Write a function

def meanLength(fileName):

that takes a string fileName containing the name of an input file like the one above. The function must return a float beting the mean sequence length.

Problem 3

Next thing is to figure out which reads overlap each other. To do that we need a function that takes two read sequences and computes their overlap. In the input data none of the reads are completely nested in another read. So given two reads that overlap, the 3′ end of the left read overlaps the 5′ end of the right read.

We know that there are no sequencing errors, so in the overlap the sequence match will be perfect. You need to loop over all possible overlaps honoring that one sequence is the left one and the other is the right one. In the for loop, start with the largest possible overlap ( min(len(left), len(right))) and evaluate smaller and smaller overlaps until you find an exact match.

Hint: In each iteration of the loop you want to to compare the last i bases of the left sequence with the first i bases of the right sequence. You can slice out the latter using right[:i].

Write a function:

def getOverlap(left, right):

that takes two string arguments, left and right, each containing a read sequence. The function must return the overlapping sequence. If there is no overlap it should return an empty string. So with these two reads:


getOverlap(s1, s2)

should return


and getOverlap(s2, s1)

should return


So in that case it seems that s1 and s2 overlap and that s1 is the left one and s2 is the right one. Treating s2 as the left one and s1 as the right one only gives an overlap of one base (we expect a few bases of overlap even for unrelates sequences).

Problem 4

Now that we can evaluate the overlap between two reads in some orientation we can compute overlaps between all pairs of reads in both left-right and right-left orientations.

Write a function:

def getAllOverlaps(reads):

that takes a dictionary argument reads as produced by readDataFromFile. The function must return a dictionary of dictionaries containing the number of overlapping bases for a pair of reads in a specific orientation. Overlap of a read to itself is meaningless and must not be included. If the overlap between read 2 in left position and 5 in right position is 21 then d['2']['5'] must be 21 (if the dictionary was called d). Make sure you understand how this data structure represents the overlaps before you go on.

Hint: d['2'] will be a dictionary where keys (d['2'].keys()) are the names of reads that have an overlap with read ‘2’ when ‘2’ is put in the left position, and values (d[‘2’].values()) are the number of overlapping bases for those reads.

Example usage:


should return a dictionary containing (maybe not with key-value pairs in that order):

{'1': {'3': 0, '2': 1, '5': 1, '4': 0, '6': 29},
 '3': {'1': 0, '2': 0, '5': 0, '4': 1, '6': 1},
 '2': {'1': 13, '3': 1, '5': 21, '4': 0, '6': 0},
 '5': {'1': 39, '3': 0, '2': 1, '4': 0, '6': 14},
 '4': {'1': 1, '3': 1, '2': 17, '5': 2, '6': 0},
 '6': {'1': 0, '3': 43, '2': 0, '5': 0, '4': 1}}

Hint: To generate all combinations of reads you need two for-loops. One looping over reads in left positions and another (inside the first one) looping over reads in right position.

Problem 5

The dictionary returned by getAllOverlaps is a little messy to look at. We want to print it in a nice matrix-like format so we can better see which pairs overlap in what orientations.

Write a function

def prettyPrint(overlaps):

that takes a dictionary argument overlaps containing a dictionary of dictionaries produced by getAllOverlaps. The function should not return anything but must print a matrix exactly as shown in the example below with nicely aligned and right-justified columns. First column must hold names of reads in left conformation. The top row holds names of reads in right conformation. Remaining cells each hold the number of overlapping bases for a left-right read pair. The diagonal corresponds to overlaps to the read itself. You must put dashes in these cells.

Example usage:


should print exactly

   1  2  3  4  5  6 
1  -  1  0  0  1 29 
2 13  -  1  0 21  0
3  0  0  -  1  0  1
4  1 17  1  -  2  0 
5 39  1  0  0  - 14 
6  0  0 43  1  0  -

Hint: To print something padded to fill three characters use string formatting like this:

print "% 3d" % 13,

The comma and the end suppresses the trailing newline that is added by default.

Notice that the overlaps are not either zero or a large number. A lot of the overlaps are 1 or 2. This is because you often find a few bases of random overlap between any two reads. So in the following we must distinguish true (significant) overlaps from random ones. We decide that true overlaps are the ones with an overlap larger than two (>2).

Problem 6

Now that we know how the reads overlap we can chain them together pair by pair from left to right to get the order in which they represent the genomic sequence. To do this we take the first (left-most) read and identify which read has the largest overlap to its right end. Then we take that read and find the read with the largest overlap to the right end of that – and so on until we reach the rightmost (last) read.

The first thing you need to do is to identify the first (leftmost) read so we know where to start. This read is identified as the one that only has a significant (>2) overlap to its right end (it only has a good overlap when positioned to the left of other reads). In the example output from prettyPrint above the first read would be read '4' because the '4' column has no significant overlaps (no one larger than two).

Write a function

def findFirstRead(overlaps):

that takes a dictionary argument overlaps returned from getAllOverlaps. It must return a string containing the name of the first read.

Example usage:


should return


Problem 7

Now that we have the first read we can recursively find the correct ordering of reads. We want a list with the read names in the right order. To see how this can be solved using recursion, consider that the task can be broken into small identical tasks of finding the read that has the largest overlap to the right end of the current read (that is the recursive case). You just keep doing this until you reach the last (right-most) read that does not have any significant (>2) overlap to its right end (that is the base case).

In the recursive case you need to identify which read that is next. I.e. which read has the largest overlap to the right end of the current read. Here we use our dictionary of overlaps. If the first read is ‘4’ then overlaps[‘4’] is a dictionary of reads with overlap to the right end of read ‘4’. So to find the name of the read with the largest overlap you must write a function that finds the key associated with the largest value in a dictionary. We do that first:

Write a function

def findKeyForLargestValue(d):

that takes a dictionary argument d and returns the key associated with the largest value. Use that function as a tool in the next function.

Write a function

def findOrder(name, overlaps):

that takes a string argument name containing the name of the first read identified in problem 6, and a dictionary argument overlaps returned from getAllOverlaps. The function must return a list of read names in the order in which they represent the genomic sequence. Hint: You know the first read is given by the name argument, you also know that you can find the next read in the chain of overlapping reads by using the findKeyForLargestValue function, and you know that you should keep adding on reads to the chain as long as the overlap is larger than two (you can use a while loop).

Example usage:
findOrder('4', overlaps)

should return:

['4', '2', '5', '1', '6', '3']

Make sure you understand why this is the right list of read names before you try to implement the function.

Problem 8

Now that you have the number of overlapping bases between reads and the correct order of the reads you can reconstruct the genomic sequence.

Write a function

def assembleGenome(readOrder, reads, overlaps):

that takes a list argument readOrder containing the order of reads, a dictionary argument reads returned from readDataFromFile and a dictionary argument overlaps returned from getAllOverlaps. The function must return a string with the genomic sequence:


Hint: iterate over the reads in order and use the overlap information to extract and join the appropriate parts of the reads.

Putting the whole thing together

Once you have all the functions you can use them in the following way to produce the genome:

fileName = 'genome_assembly_input.txt'
reads = readDataFromFile(fileName)

print meanLength(fileName)

overlaps = getAllOverlaps(reads)


name = findFirstRead(overlaps)
order = findOrder(name, overlaps)

genome = assembleGenome(order, reads, overlaps)
print genome


None yet

Exercise: Finding open reading frames


In this exercise we will be looking at DNA from Streptococcus bacteria. We will work with the detection of open reading frames in this DNA and with the translation from DNA to amino acid sequences. The purpose of these exercises is to get familiar with strings, list and dictionaries and to learn and use some common methods on these data types. Another pupose is to give you an example of breaking down a larger problem into smaller more manageable sub-problems.

Read Wikipedia: Genetic Code and Wikipedia: Open Reading Frame if you do not already know of DNA to amino acid translation and open reading frames.

The exercise has three parts. In the first exercise we work on how to translate an open reading frame into a protein sequence. In the second exercise we work on how to identify such open reading frames in a long genomic sequence. In the last part you will put the two first parts together so you can find candidate protein sequences from in a genome.

Start by reading through the exercise before you do anything else. That way you have a good overview of the tasks ahead. As you will see there is quite a bit of hints to help you on your way. Here is a mind map of how we split of the larger problem into smaller bits and how they fit together:


This is the real thing. It is not a toy problem, which also means it has to be a bit more elaborate. You are supposed to think this is hard – but don’t despair – this is as hard as it gets in this course. Come prepared for the computer exercises by completing as much as you can at home. That way you will get the most of the help Dan can provide –  so you complete the last bits later on our own.

Part 1: Translating Open Reading Frames

In this exercise we write code to translate an open reading frame on a DNA sequence into single letters representing the corresponding sequence of amino acids.

Translating a single codon

Start by downloading this file: Then open it in your editor and look at it. It contains a dictionary that maps codons to letters that represent amino acids. It also contains a string variable, start_codon, and a list variable, stop_codons. You can import these variables into your script like this:

from codon_translation import codon_map,
start_codon, stop_codons

Write a function, translateCodon(codon) that takes a string codon, and if codon is a valid codon it returns codon translated into the corresponding amino acid using the codon_map dictionary in the codon_translation module. If amino acid codon is not a valid codon, the function should return '?'.

Example usage:

print translateCodon('ACG')


Notice that the keys in codon_map dictionary are in upper case, so you must make sure that codon is in upper case for the look up. You can translate codon into an upper case version of itself using the upper() method:

codon = codon.upper()

Before you start coding you should outline for yourself intuitively what you need to do to implement the desired functionality. In this case want to translate, or map, between a three letter sequence, codon, and the corresponding one letter symbol for the amino acid that the codon corresponds to. Your key is the string, codon, with the three bases and the value you retrieve using that key is the letter symbol for the amino acid. Try that out first. E.g.:

codon = "TTG"
aminoAcid = codon_map[codon]
print aminoAcid

What you then need to do is to create a function that takes a string as an argument and returns the corresponding amino acid symbol after having looked it up in the dictionary. Make a function that does just that before you go on.

Before you are completely done you need to make your function handle the situation when the argument to the function is not a key in the codon_map dictionary. Use an if-else construct to handle the two cases. The boolean expression needs to test if codon is a key in codon_map. Remember the has_key() method? – or the even cooler in operator on dictionaries?

Splitting an ORF into codons

To translate an entire open reading frame into the corresponding amino acid sequence, we need to split the DNA sequence into codons. Then we can translate each codon using the function translateCodon(codon) we just wrote.

Write a function, splitCodons(orf), that takes a string orf and splits it into triplets returning a list of these triplets.

Example usage:

print splitCodons('atcgagcttadcgta')

['atc', 'gag', 'ctt', 'adc', 'gta']

Divide the exercise into simpler tasks like above. You need to loop over the sequence to perform operations on it. Start by writing a function that prints each character:

def splitCodons(orf):
for i in range(len(orf)):
print orf[i]

Try now to figure out how you can modify the function to make it move over the sequence in jumps of three. Look at the documentation for the range function (Google it!) to see how you can make range return a list of numbers with increments of three like this: 0, 3, 6, 9, 12, … . Modify your function so that it now prints every third character.

What you want is obviously not every third character but the three characters that follow every third character. You can use index i in the for loop to get the corresponding codon using slicing like this:

codon = orf[i:i+3]

Modify your function so that it prints each codon.

Now all that remains is to put each codon on a list that we can return from the function. You can define a list codonList before your for loop so you have a list to add codons to:

codonList = []

Each codon you slice out of the orf you add to the end of codonList using the append method:


Modify your function so that it creates and returns a list of all the codons and you are done.

Translating an ORF

Now write a function, translateSequence(orf), that takes a DNA sequence as input and translates it into the corresponding amino acid sequence.

Example usage:

print translateSequence('atcgagcttadcgta')


The function should first split the ORF into codons. Use your splitCodons(orf) function to take care of that. So start by writing a function that just calls and prints the result of splitCodons(orf):

def translateSequence(orf):
codonList = splitCodons(orf)
return codonList

Make that work first.

Now modify the function by using the translateCodon(codon) function to translate each codon into the corresponding amino acid independently. Instead of returning the codonList create a for loop that iterates over the codons in codonList so you can translate each codon using translateCodon(codon) and print the amino acid letter.

Now modify the function so each amino acid letter is appended to a list letterList in the same way you appended codons to codonList in the previous exercise. Make the function print the list so you can see it looks okay.

Now all you need to do is to turn the list of letters, letterList, into a string that you can return from the function. Use the join() method on an empty string to create a string from the list of amino acids.

aminoAcids = ''.join(letterList)

Part 2: Finding Open Reading Frames

In this exercise we write code that finds and translates open reading frames in a DNA sequence.

Read in sequences

Download the following file that contain sample sequences that we want to analyse to find open reading frames:


You can use the code below to read the sequences into a list. Make sure you understand how it works before you write it into your script. Do not copy/paste it.

f = open('sample_sequences.txt', 'r')
sequences = list()
for line in f:
seq = line.strip()

Finding the start positions of ORFs in a DNA sequence

Write a function, findStartPositions(seq), that takes a DNA sequence seq as input and returns a list of the positions where the start codons occur. You already imported start_codon and stop_codons from the module codon_translation.

Example usage:

print findStartPositions(sequences[0])

[21 39, 47, 94, 197, 395, 450, 458, 539, 568]

Your function should contain a for loop that iterates over all possible positions in the seq where a codon can begin. Not surprisingly these are all the positions except for the last two. So start out with this:

def findStartPositions(seq):
for i in range(len(seq) - 2):
print i

Now, instead of just printing i try and make it print the three bases following i using the slicing technique you also used in the splitCodons(orf) function. E.g. if you sequence is 'TGACTG' it should first print 'TGA' then 'GAC' then 'ACT' and so on.

When you have this working you need to replace the printing of codons with an if construct that tests if each triplet is equal to start_codon. Try and make your function only print i when i is the start of an ORF.

Finally, modify the function so all the i values are collected in a list using the same technique as in the splitCodons(orf) function and then return this list from the function.

Finding the next specific codon in an ORF

Now that we can find where the ORFs begin in our sequence we must also be able to identify where each of these end. As you know an ORF ends at any of three different stop codons in the same reading frame as the start codon. So we need to be able to find the next occurence of some specific codon starting from the beginning of the reading frame.

Write a function, findNextCodon(seq, start, codon), that returns the index for beginning of the next occurrence of the codon, codon, in seq when starting at index start and staying in the reading frame of start, i.e. you should look at codons in jumps of three. If the function does not find codon in the string it should return None.

Example usage:

print findNextCodon('AAAAATTTAATTTAA', 1, 'TTT')


Your function should contain a for loop that iterates over all the relevant starts of codons. Remember that no valid codon can start at the last two positions in the sequence. E.g. if start is 7 and len(seq) is 20 then the relevant indexes are 7, 10, 13, 16. Figure out how to make the range function return the appropriate list of numbers.

Start by writing a function just with this for loop that prints the numbers you get from the correct use of range.

def findNextCodon(seq, start, codon):
for i in range( ?? ):
print i

When you have that working use the slicing technique to print the codons starting at each position instead.

Finally, add an if statement that tests if each codon is equal to codon and if so makes the function return i.

Finding the next stop codon in an ORF

We need to write a function, findNextStopCodon(seq, start) that returns the index for the beginning of the next stop codon in seq starting in position start and staying in start’s reading frame. If there is no stop codon in start’s reading frame the function should return None. You can find the stop codons in stop_codons (and you probably want to reuse the function you just wrote).

Example usage:

print findNextStopCodon('AAATAGATGAAAA', 0)


This is a tricky function so we will walk through it together:

def findNextStopCodon(seq, start):
seq = seq.upper()
results = []
for stop_codon in stop_codons:
pos = findNextCodon(seq, start, stop_codon)
if pos != None:
if len(results) > 0:
return min(results)
return None

First we upper case the seq. Then we define a list to hold the indexes for our candidate stop codons. Then we loop over the possible stop codons to find the next of each in the reading frame. If findNextCodon(seq, start, stop_codon) does not find anything it returns None. We only want to keep results that are not None which is why we test if pos != None. At the end we test if we have any indexes in our results list. If we do find indexes for stop codons we return the smallest of them which is the one closet to the start If we do not find any stop codons the function returns None by default.

Finding ORFs

Write a function, findOpenReadingFrames(seq). It should return a list of all open reading frames in the input sequence seq, where an open reading frame is represented by a tuple (start, stop). start is the position of the start codon and stop is the position of the stop codon. The function should handle both uppercase and lowercase inputs.

Example usage:

print findOpenReadinFrames(sequences[0])

[(21, 48), (39, 48), (47, 89), (94, 154), (197, 503),
(395, 503), (450, 456), (458, 503), (539, 569), (568, 634)]

Write a function that first uses findStartPositions(seq) to get a list of all the start positions in seq:

def findOpenReadingFrames(seq):
startPositions = findStartPositions(seq)

When you have that working add a for loop that iterates over the start positions so you can find the position of the next stop codon for each start position. Use findNextStopCodon(seq, start) for that. Try and print the start and end indexes you find.

Finally, you need to add the (start, end) pairs to a list so you can return them. To append a pair to a list, you need to write:


notice the double parenthesis – you give a tuple as argument to the append method.

Test your function. Chances are that some of the end positions you get are None. This is because there was no stop codon for some of the start codons found. Add an if statement to your function to control that only pairs with valid stop codons are added to the list of results.

Finding AND translating ORFs

Write a function findAndTranslateORFs(seq). Given a DNA sequence seq it should return a list of the translations of all open reading frames in the sequence. It should handle both uppercase and lowercase inputs.

Example usage:

print findAndTranslateORFs(sequences[0])


The function should call findOpenReadingFrames(seq) to get the list of (start, end) pairs and then use a for loop to call translateSequence(seq[start:stop+3] on each pair:

def findAndTranslateOpenReadingFrames(seq):
for start, stop in findOpenReadingFrames(seq):
print translateSequence(seq[start:stop+3])

All that remains then is to add each amino acid sequence to a list that the function can

Hint: To check your result note that all returned sequences should start with a start
codon 'M', end with a stop codon '*' and contain no stop codons in the middle.


Code can be downloaded from the last table on the course front page.

Exercise: Comparing HIV sequences


In this exercise you will work with a HIV sequence. It is a HIV-1 sequence but we also want to know that type it is. To help us type our sequence we have access to an alignment of sequences for which the type is known. By comparing our HIV sequence to the sequences of each type we can identify the most similar sequence which then is assumed to be of the same type as our HIV sequence.

Comparing two sequences

First we need to write some code that compares two sequences so we can compare our HIV sequence to each of the HIV sequences in the database. Write a function pairWiseDifferences(sequence1, sequence2) that computes the proportion of bases that differ between two DNA sequences of the same lengthsequence1 and sequence2 are strings. Example usage:

d = pairWiseDifferences('AGTC' 'AGTT')

Remember that range(len(sequence1)) generates a list of the indexes you can use to index the string. So you need a for loop in your function and a variable that keeps track of how many differences you see have seen as you iterate through the sequences. This is the variable the function returns.

Read in sequence data

Now download the following files:

The file unknown.txt contain a HIV sequence of unknown type. The other four files contain database HIV sequences of type A, B, C and D. Put them in the folder where you keep your other python code. On most computers you can right-click on the link and have it “save files as…”. Have a look through at what is in the module.

Here is the code you need to read the sequences from the typeA.txt file into a list of sequences called typeA_list.

f = open('typeA.txt', 'r')
typeA_list = list()
for line in f:
seq = line.strip()

You need to do this five times (once for each file) to produce a list of sequences of each known type and a list of sequences of unknown type. Remember to get the modify the code above accordingly, so you produce lists with the following names: typeA_list, typeB_list, typeC_list, typeD_list and unknown_list (the last one will only have one sequence in it).

Comparing aligned sequences

All sequences, including the unknown sequence, are from the same multiple alignment. This means that sequence positions match up across all sequences but also that a lot of gap characters, '-' are inserted.

That means that you need to modify pairWiseDifferences(seq1, seq2) so it does not consider sequence positions where both bases are '-' characters. In other words, you not only need to count differences. You also need to count how many alignment columns that are not "-" for both sequences. E.g. the following mini alignment has five such columns:


Hint: use an if-statement in pairWiseDifferences with a boolean expression like this:

not (seq1[i] == '-' and seq2[i] == '-')

Once your function has computed both the number of differences and the number of relevant positions you can have it return the fraction of different bases as nrDifferences/float(nrRelevantPositions).

Compare your HIV sequence to the HIV sequences with known type

Use the function pairWiseDifferences(seq1, seq2)  to find the database sequence that is most similar to the sequence unknown_list[0] to determine its type.

To type you HIV sequence you need compare your sequence to all the database sequences to see which group has the best matching sequence.

Hint: For each list, e.g. typeA_list, you can go through the aligned sequences with a for loop to calculate the similarity to each sequence.

You can define a list that you can add your results for each type to like this:

similaritiesTypeA = []

and then append results to the list as you calculate them in the for loop. Look up the append() method. Once you are done you need to find the smallest result in the list (fewest differences). Try the min() function.

Solutions to exercise

Code can be downloaded from the last table on the course front page.

Exercise: Base composition of HIV sequences

In this exercise we want to compute the frequency of the four nucleotides in the gene.

Counting bases

Write a function, countBases(seq), that, given a DNA string seq returns a dictionary that maps each base to the number of occurrences of that base in seq.

Example usage:

print countBases('CTGGCCCT')

{'A': 1, 'C': 4, 'T': 2, 'G': 2}

Then try it out on your HIV sequence and see what you get.

Computing frequencies

Write a function, baseComposition(seq), that given a DNA string seq returns a dictionary with the frequency of each base. Call countBases(seq) from within baseComposition(seq) to count the bases.

Example usage:

frequencies = baseComposition('ACTG')
print frequencies

Should print
{'G': 0.25, 'T': 0.25, 'C': 0.25, 'A': 0.25}

Printing frequencies

Write a function, printFrequencies(seq), that given a dictionary with frequencies (like that returned by baseComposition) prints the frequency of each base like this:

A: 0.25
C: 0.25
T: 0.25
G: 0.25

Now for something more complicated…

The purpose of this last exercise is to see if you can understand a more complicated code example. First download this file:


Once you have downloaded the file you can open it in Sublime Text (yes with works for all kinds of text files) and see what it looks like. Each line in the sequence file has a sequence name followed by a space followed by a sequence.

Now understand and explain in detail what the code below does and how. Write it (no copy/paste) into your editor so you can add print the values of different variables.

hivFile = open('hivsequences.txt', 'r')
statistics = {}
for l in hivFile:
name, seq = l.split()
if name not in statistics:
statistics[name] = {}
for b in seq:
if b not in statistics[name]:
statistics[name][b] = 0
statistics[name][b] += 1

for name in statistics:
print name
total = sum(statistics[name].values())
for b in statistics[name]:
print '\t', b, statistics[name][b] / float(total)

Solutions to exercise

Code can be downloaded from the last table on the course front page.