Munch Lab

Genome assembly

In genome assembly many short sequences (sequencing reads) from a sequencing machine is assembled into long sequences – ultimately chromosomes. This is done by ordering overlapping reads so they together represent genomic sequence. For example: ordering these three reads: AGGTCGTAG, CGTAGAGCTGGGAG, GGGAGGTTGAAA based on their overlap like this:

AGGTCGTAG
    CGTAGAGCTGGGAG
             GGGAGGTTGAAA

produces the genomic sequence they together represent:

AGGTCGTAGAGCTGGGAGGTTGAAA

So very briefly the task is to read in the sequence reads, identify overlaps between them, find the right order of reads and then reconstruct the genomic sequence from the overlapping reads.

Real genome assembly is of course more sophisticated than what we do here, but the idea is the same. We make two important simplifying assumptions.

  1. There are no sequencing errors implying that overlaps between reads are perfect matches.
  2. No reads are nested in other reads. I.e. always overlaps of

this type:

XXXXXXXXXX
    XXXXXXXXXXX

never this type:

XXXXXXXXXXXXXX
   XXXXXXXX

In this exercise you will write functions that together solve the problem of assembling genomic sequence. Some of the functions you make solve subtasks that that are used by other functions.

For this exercise there is a file with input data that you need to download. The format of this file is like this:

1 ATGCG...
2 AGGCG...
3 TGAAG...

Each line represents a read. The first field on each line is the name of the read and the second field is the read sequence itself. So for the first line ‘1’ is the name and ‘ATGCG…’ is the sequence.

Read sequencing reads from a file

The first task is to read and parse the input data. Write a function

def readDataFromFile(fileName):

that takes a string fileName as argument. To read in genome_assembly.txt you need to put it in the same directory (folder) as the file with python code you are writing. The function must return a dictionary where keys are read names and values are the associated read sequences. Both keys and values must be strings.

Example usage: readDataFromFile('genome_assembly.txt') should return a dictionary with the following content (maybe not with key-value pairs in that order)

{'1': 'GGCTCCCCACGGGGTACCCATAACTTGACAGTAGATCTCGTCCAGACCCCTAGC',
 '3': 'GTCTTCAGTAGAAAATTGTTTTTTTCTTCCAAGAGGTCGGAGTCGTGAACACATCAGT',
 '2': 'CTTTACCCGGAAGAGCGGGACGCTGCCCTGCGCGATTCCAGGCTCCCCACGGG',
 '5': 'CGATTCCAGGCTCCCCACGGGGTACCCATAACTTGACAGTAGATCTC',
 '4': 'TGCGAGGGAAGTGAAGTATTTGACCCTTTACCCGGAAGAGCG',
 '6': 'TGACAGTAGATCTCGTCCAGACCCCTAGCTGGTACGTCTTCAGTAGAAAATTGTTTTTTTCTTCCAAGAGGTCGGAGT'}

Hint: Make an empty dictionary to populate with the date from the file. Open the file and read it line by line using a for loop. You can use the split() method of the strings that hold each line to split it into the fields in each line. Once you have them you can use the right keys and values to update the dictionary.

As always – make sure you know exactly what you want to do before you start coding. Keep it working by testing that your code does what you think it does by – preferably for every line of code added. You can temporarily add print statements ensure this.

Compute the mean length of your reads

Often there are too many reads to look at them manually. We want to know what the mean length of the reads is.

Write a function

def meanLength(fileName):

that takes a string fileName containing the name of an input file like the one above. The function must return a float beting the mean sequence length.

To compute the mean (or average) you need all the lengths of all the reads so you can compute their sum and divide by the number of reads. The function you are asked to make takes a file name as argument so it should make use of your readDataFromFile() function. That is, you call the function readDataFromFile() inside the meanLength() function to get the data in the form of a dictionary.

def meanLength(fileName):
    reads = readDataFromFile(fileName)

Now use the accumulator pattern by using a for loop to iterate over all the values in the in the reads dictionary. You can use the values() method of the dictionary object.

total = 0
for r in reads.values():
    # here you must find the length of r and add it to the total

Now all that remains is to divide the total my the number of read and return that value.

Compute overlaps between two sequencing reads

Next thing is to figure out which reads overlap each other. To do that we need a function that takes two sequencing reads and computes their overlap. In the input data none of the reads are completely nested in another read. So given two reads that overlap, the 3′ end of the left read overlaps the 5′ end of the right read.

We know that there are no sequencing errors, so in the overlap the sequence match will be perfect. You need to loop over all possible overlaps honoring that one sequence is the left one and the other is the right one. In the for loop, start by evaluating with the largest possible overlaps to the left read (which is of cause len(left)) and evaluate smaller and smaller overlaps until you find an exact match between the left read and an other read.

Hint: In each iteration of the loop you want to to compare the last len(left)-i bases of the left sequence with the first len(left)-i bases of the right sequence. You can slice out the latter using right[:len(left)-i] and test if they are the same:

left[i:] == right[:len(left)-i]

The function we need to write takes two string arguments, left and right, each representing a sequencing read:

def getOverlap(left, right):

The function should return the longest perfectly matching sequence between the left and right read. If there is no overlap it should return an empty string. So with these two reads:

s1 = "CGATTCCAGGCTCCCCACGGGGTACCCATAACTTGACAGTAGATCTC"
s2 = "GGCTCCCCACGGGGTACCCATAACTTGACAGTAGATCTCGTCCAGACCCCTAGC"

getOverlap(s1, s2) should return 'GGCTCCCCACGGGGTACCCATAACTTGACAGTAGATCTC' and
getOverlap(s2, s1) should return 'C'

In that case it seems that s1 and s2 may fit together and that s1 is the left one and s2 is the right one. Treating s2 as the left one and s1 as the right one only gives an overlap of only one base (we expect a few bases of overlap even for unrelates sequences).

Get the overlap between all pairs of reads

Now that we can evaluate the overlap between two reads in some orientation we can compute overlaps between all pairs of reads in both left-right and right-left orientations.

To do that we write a new function:

def getAllOverlaps(reads):

that takes a dictionary argument reads as produced by readDataFromFile. The function must return a dictionary of dictionaries that contain the number of overlapping bases for a pair of reads where the first in the pair is the left read and the second is the right read. Overlap of a read to itself is meaningless and should not be included. If the overlap between read 2 in left position and 5 in right position is 21 then d['2']['5'] must be 21 (if the dictionary was called d).

Hint: d['2'] will be a dictionary where keys (d['2'].keys()) are the names of reads that have an overlap with read ‘2’ when ‘2’ is put in the left position, and values (d[‘2’].values()) are the number of overlapping bases for those reads.

Make sure you understand how this data structure represents the overlaps before you go on.

Example usage: getAllOverlaps(reads) should return a dictionary like this (maybe not with key-value pairs in that order):

{'1': {'3': 0, '2': 1, '5': 1, '4': 0, '6': 29},
 '3': {'1': 0, '2': 0, '5': 0, '4': 1, '6': 1},
 '2': {'1': 13, '3': 1, '5': 21, '4': 0, '6': 0},
 '5': {'1': 39, '3': 0, '2': 1, '4': 0, '6': 14},
 '4': {'1': 1, '3': 1, '2': 17, '5': 2, '6': 0},
 '6': {'1': 0, '3': 43, '2': 0, '5': 0, '4': 1}}

Hint: To generate all combinations of reads you need two for-loops. One looping over reads in left positions and another (inside the first one) looping over reads in right position.

Print a pretty table of the overlaps between reads

The dictionary returned by getAllOverlaps is a little messy to look at. We want to print it in a nice matrix-like format so we can better see which pairs overlap in what orientations.

So we write a function that does that:

def prettyPrint(overlaps):

that takes a dictionary argument overlaps containing a dictionary of dictionaries produced by getAllOverlaps. The function should not return anything but must print a matrix exactly as shown in the example below with nicely aligned and right-justified columns. First column must hold names of reads in left conformation. The top row holds names of reads in right conformation. Remaining cells each hold the number of overlapping bases for a left-right read pair. The diagonal corresponds to overlaps to the read itself. You must put dashes in these cells.

Example usage: prettyPrint(overlaps) should print exactly

   1  2  3  4  5  6 
1  -  1  0  0  1 29 
2 13  -  1  0 21  0
3  0  0  -  1  0  1
4  1 17  1  -  2  0 
5 39  1  0  0  - 14 
6  0  0 43  1  0  -

Hint: To create a string padded with spaces to fill three characters use string formatting like this:

"{:>3}".format(4)

print() adds a newline at then end when you print someting. To keep it from doing that, so you can print several numbers after each other, you can give print the the extra keyword argument end='' – telling it to put nothing (an empty string) at the end. (Remember to put from __future__ import print_function at the first line of your script – otherwise this will not work.)

Notice that the overlaps are not either zero or a large number. A lot of the overlaps are 1 or 2. This is because you often find a few bases of random overlap between any two reads. So in the following we must distinguish true (significant) overlaps from random ones. For this exercise we decide that true overlaps are the ones with an overlap larger than two (>2).

Find the read that correspond to the left end of the assembled sequence

Now that we know how the reads overlap we can chain them together pair by pair from left to right to get the order in which they represent the genomic sequence. To do this we take the first (left-most) read and identify which read to the right that has the largest overlap. Then we take read identified and find the read with the largest overlap to the right end of that – and so on until we reach the rightmost (last) read.

The first thing you need to do is to identify the first (leftmost) read so we know where to start. This read is identified as the one that only has no significant (>2) overlap to its left end – it only has a good overlap when positioned to the left of other reads. In the example output from prettyPrint above we can see that the first read would be read ‘4’ because the ‘4’ column has no significant overlaps (no one larger than two).

Now write the function

def findFirstRead(overlaps):

that takes a dictionary argument overlaps returned from getAllOverlaps. It must return a string containing the name of the first read.

Example usage: findFirstRead(reads) should return 4

Find the next read

Now we have the first read but to produce the complete ordering of reads we also need to be able to find the next. To find the next read we use our dictionary of overlaps returned by the getAllOverlaps function. If the first read is ‘4’ then overlaps[‘4’] is a dictionary of reads with overlap to the right end of read ‘4’. So to find the name of the next read you must find the key in that dictionary with the largest overlap value. I.e. if you your left read is ‘4’ then the relevant dictionary is overlaps['4'], which is {'1': 1, '3': 1, '2': 17, '5': 2, '6': 0}, and you can read the next read as the key with the largest value. In this case it is ‘2’.

Now write the function

def findNextRead(d):

that takes a dictionary argument d and returns the key associated with the largest value as shown above.

Find the correct ordering of sequencing reads

Now that we have functions for finding the first read and for finding the next read we can use them to produce the list of reads in the right order. So we need a function:

def findOrder(name, overlaps):

The function should take a string argument name containing the name of the first read, and a dictionary argument overlaps returned from getAllOverlaps. The function must return a list of read names in the order in which they represent the genomic sequence.

Example usage:
findOrder(‘4’, reads): should return: ['4', '2', '5', '1', '6', '3']

Make sure you understand why this is the right list of read names before you try to implement the function.

This can be solved in two ways: using recursion and using a while loop. Try both.

To see how this can be solved using recursion, consider that the task can be broken into small identical tasks of finding the read that has the largest overlap to the right end of the current read. You just keep doing this until you reach the last (right-most) read that does not have any significant (>2) overlap to its right end (that is the base case). Here is what that version looks like:

def findOrder(name, d):
    nextName = findNextRead(d[name])
    if d[name][nextName] <= 2: # no significant overlap to the right
        # this is the base case, we have reached the rightmost read
        return [name]
    else:
        # this is the recursive case
        return [name] + findOrder(nextName, d)

 

In the version using a while loop you start by making a list to hold your order of reads. When you begin, it should only have the first read in it: order = [name]. Then find the next read like in the first line in the recursive version. Then write a while loop that tests if d[name][next] > 2. As long as this is true we have not reached the rightmost read and we should append next to the order list. Also inside the while loop you should assign new values to name and nextName: read, next = next, findKeyForLargestValue(d[next]). That moves us forward one read so the while loop will eventually reach the rightmost read. Once the while loop completes all that remains is to return order. Here is some of it to help you along.

def findOrder(name, d):
    order = [name]
    nextName = findNextRead(d[name])
    while d[name][nextName] > 2:
        # from here you are on your own...

Assemble the sequence

Now that you have both the overlaps between reads and the correct order of the reads you can reconstruct the genomic sequence.

You need to write a function

def assembleGenome(readOrder, reads, overlaps):

that takes a list argument readOrder containing the order of reads, a dictionary argument reads returned from readDataFromFile and a dictionary argument overlaps returned from getAllOverlaps. Use the functions you have written to produce and return a string with the genomic sequence that the reads represent.

Hint: iterate over the reads in readOrder and use the overlap information to extract and the appropriate part of each read (using slicing) before add it to the genome sequence.

This should be your final genome sequence:

TGCGAGGGAAGTGAAGTATTTGACCCTTTACCCGGAAGAGCGGGACGCTGCCCTGCGCGATTC
CAGGCTCCCCACGGGGTACCCATAACTTGACAGTAGATCTCGTCCAGACCCCTAGCTGGTACG
TCTTCAGTAGAAAATTGTTTTTTTCTTCCAAGAGGTCGGAGTCGTGAACACATCAGT

Solution

Download linked file

One thought on “Genome assembly”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s

%d bloggers like this: