Munch Lab

Exercise: Codon usage bias


This exercise is a continuation of the previous exercise about open reading frames. Codon usage bias refers to differences in the frequency of occurrence of synonymous codons in coding DNA. A codon is a series of three nucleotides (triplets) that encodes a specific amino acid residue in a polypeptide chain or for the termination of translation (stop codons). There are 64 different codons (61 codons encoding for amino acids plus 3 stop codons) but only 20 different translated amino acids. The overabundance in the number of codons allows many amino acids to be encoded by more than one codon. Because of such redundancy it is said that the genetic code is degenerate. Different organisms often show particular preferences for one of the several codons that encode the same amino acid, that is, a greater frequency of one will be found than expected by chance. How such preferences arise is a much debated area of molecular evolution.

Outlining the problem

Given an open reading frame like the ones you found in the previous exercise, you must compute the codon usage. Your goal is to create a dictionary with keys corresponding to amino acids (i.e. single letter strings designating an amino acid such as “R” for arginine).

The value associated with each amino acid key should also be a dictionary, and the keys of this dictionary should be the different codons that encode the amino acid and the value associated with each key should be the relative frequency with which that codon is used to encode that amino acid.

You should only include information about the different amino acids that are used in your ORF data. Also, you should handle both uppercase and lowercase input sequences. Once you are done it should look somewhat like this:

{'A': {'GCA': 0.0, 'GCC': 0.0, 'GCT': 1.0, 'GCG': 0.0}, 
'C': {'TGC': 0.0, 'TGT': 1.0}, 'E': {'GAG': 0.33333333333333331, 
'GAA': 0.66666666666666663}, 'D': {'GAT': 1.0, 'GAC': 0.0}, 
'G': {'GGT': 0.33333333333333331, 'GGG': 0.0, 
'GGA': 0.66666666666666663, 'GGC': 0.0}, 'F': {'TTC': 0.0, 
'TTT': 1.0}, 'I': {'ATT': 1.0, 'ATC': 0.0, 'ATA': 0.0}, 
'H': {'CAC': 0.0, 'CAT': 1.0}, 'K': {'AAG': 0.20000000000000001,
 'AAA': 0.80000000000000004}, '*': {'TAG': 0.0, 'TGA': 1.0, 
'TAA': 0.0}, 'M': {'ATG': 1.0}, 'L': {'CTT': 0.0, 
'CTG': 0.66666666666666663, 'CTA': 0.0, 'CTC':0.0, 'TTA':
 0.33333333333333331, 'TTG': 0.0}, 'N': {'AAT': 0.5, 
'AAC': 0.5}, 'Q': {'CAA': 0.59999999999999998, 
'CAG': 0.40000000000000002}, 'P': {'CCT': 0.5, 'CCG': 0.0,
 'CCA': 0.5, 'CCC': 0.0}, 'S': {'TCT': 0.0, 'AGC': 0.0, 
'TCG': 0.0, 'AGT': 0.5, 'TCC': 0.0, 'TCA': 0.5}, 
'R': {'CGA': 0.33333333333333331, 'CGC': 0.0, 
'AGA': 0.33333333333333331, 'AGG': 0.0, 'CGG': 0.0,
'CGT': 0.33333333333333331}, 'T': {'ACC': 0.0, 'ACA': 0.0, 
'ACG': 0.0, 'ACT': 1.0}, 'W': {'TGG': 1.0}, 'V': 
{'GTA': 0.66666666666666663, 'GTC': 0.0, 'GTT': 0.16666666666666666, 
'GTG': 0.16666666666666666}, 'Y': {'TAT': 1.0, 'TAC': 0.0}}

This does not look to pretty. So our ultimate goal is to print so it is easily read – like this:

    GCA: 0%
    GCC: 0%
    GCT: 100%
    GCG: 0%
    TGC: 0%
    TGT: 100%
    GAG: 33%

Before you read on think about how to break up this task into smaller functions that are more easily implemented.

In this exercise we go through one of the many ways to do this. We split the solution to the problem into five functions, two of which you have already implemented.

  • split_codons(orf) that produces a list of codons in an ORF.
  • count_codons(orf) that produces a dictionary of the codons used in the orf, mapping to the number of times each codon has been used.
  • group_counts_by_aa(counts) that takes the dictionary of all codon counts that is produced by count_codons(orf) and then groups the counts by the amino acid the codons encode. The result is a dictionary of dictionaries like the one shown above, but with counts of codons – not the frequencies as in the example above.
  • normalize_counts(d) that takes the dictionary produced by codon_counts_per_aa(orf) and replaces the counts for each amino acid with frequencies.
  • compute_codon_bias(orf) that uses the functions above to make the normalised dictionary of dictionaries that we need.
  • print_codon_bias(orf) that prints the dictionary returned by findCodonBias(orf) in a nice way.


You will need to download unless you did so in the previous exercise. If you put the file where you have your other code, then you can import variables with the codon_map and the start and stop codons into your script by putting this at the top of your file where you write our own code:

from codon_translation import codon_map

You also need to down load some open reading frames to test your code on. You can download a file
sample_orfs.txt. If you put the file where you have your other code then you can use this bit of code to read the ORFs into a list:

f = open('sample_orfs.txt', 'r')
orf_list = list()
for line in f:
seq = line.strip()

Try to print the list to see it. Then pick out the first ORF in the list so you can use that to test your code:

test_orf = orf_list[0]

Split the ORF into codons

The first one you have already implemented in the previous exercise – and if not here it is in my version to get you started.

def split_codons(orf):
codon_list = []
for i in range(0, len(orf)-2, 3):
return codon_list

Count codons in an ORF

Write a function count_codons(orf) that counts the the occurrences of codons in the ORF. You need to use the split_codons function to split the ORF into a list of codons. Then you can create an empty dictionary that you can populate with counts. You want all the possible codons to be in your dictionary, so that the codons you do not find in your ORF will a count of 0. That means that you must start by filling the dictionary with a key for each codon and give each a count of 0. You can do that by iterating over the keys in codon_map. Then you must iterate over all the codons in the list of codons produced by the split_codons function, and add counts to the dictionary as you go along. We treated a very similar example of this accumulator pattern in a lecture.

Remember to “uppercase” you off so that count_codons("atgATGTAA") returns:

{'ATG': 2, 'TAA': 1}

and not:

{'ATG': 1, 'tag': 1, 'TAA': 1}

Group codon counts by amino acid

Write a function group_counts_by_aa(codon_counts) that takes as argument the dictionary returned by count_codons(orf). It must return a dictionary that pairs each of the 20 amino acids with a dictionary with counts of how many times each codon is used to encode that amino acid. Once you have made this data structure (and say you call it grouped_counts) you can access each count like this:


and since you can already compute a dictionary with codon counts like this

codon_counts = count_codons(orf)

the task is really to distribute those counts into groups for each amino acid like this (but not literally like this):

grouped_counts[acid][codon] = codon_counts[codon]

Make sure you understand this before you go on.

Your function must start by defining an empty dictionary to add to. Use count_codons(orf) to get a dictionary of all codon counts. Then use the .items() dictionary method to produce all the (key, value) pairs in codon_map so you can loop over all the (key, value) pairs like this:

def group_counts_by_aa(counts):
counts_by_aa = {}
for codon, acid in codon_map.items():
# from here you are on your own...

Your dictionary should not include amino acids that are not in the ORF, but you want all the possible codons for each amino acid represented in your nested dictionaries. Your result would not correctly represent codon bias if the codons that where not used was not included. We include the codons that we did not see in the ORF by looping over the codon_map you imported earlier.

Normalize counts

You need to normalize the counts so they become frequencies (i.e. count/total). Write a function normalize_counts(codon_counts_by_aa) that takes a dictionary of dictionaries (as returned by group_counts_by_aa(off) ) and produces a new dictionary of dictionaries that has frequencies in stead of counts. The frequencies for codons that encode the same amino acid must sum to one.  If the input dictionary is called codon_counts_by_aa the dictionary with counts for the ‘V’ amino acid was (codon_counts_by_aa['V']) was:

{'GTA': 4, 'GTC': 0, 'GTT': 1, 'GTG': 1}

in the input dictionary, then it should be:

{'GTA': 0.6666666666666666, 'GTC': 0.0, 'GTT': 0.16666666666666666, 'GTG': 0.16666666666666666}

in the output dictionary.

To do that you should first produce a dictionary with amino acids as keys and empty dictionaries as values. Then you can loop over the and keys of codon_counts_by_aa and its inner dictionaries holding the counts:

for aa in codon_counts_by_aa:
for codon in codon_counts_by_aa[aa]:

You need to compute the total counts for each amino acid too so you can produce each frequency like this:

codon_freq_by_aa[aa][codon] = codon_counts_by_aa[aa][codon] / float(total)

Compute the codon bias

Write a function compute_codon_bias(off) that uses count_codons(off), group_counts_by_aa(orf) and normalize_counts(d) to make the normalised dictionary of dictionaries that was showed in the beginning of the exercise. It should look something like this:

def find_codon_bias(orf):
codon_counts = count_codons(orf)
stats_per_aa = group_counts_by_aa(codon_counts)
return stats_per_aa

Print it nicely

Write a function print_codon_bias(orf) that prints the dictionary returned by compute_codon_bias(orf) in a nice way.  This is done using two nested for loops, the first one iterating over amino acids and the second iterating over the codons for each amino acid.

    GCA: 0%
    GCC: 0%
    GCT: 100%
    GCG: 0%
    TGC: 0%
    TGT: 100%
    GAG: 33%

Solution to exercise

You can download the solution form the last table on the course front page.

One thought on “Exercise: Codon usage bias”

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: