Munch Lab

Exercise: Comparing HIV sequences


In this exercise you will work with a HIV sequence. It is a HIV-1 sequence but we also want to know that type it is. To help us type our sequence we have access to an alignment of sequences for which the type is known. By comparing our HIV sequence to the sequences of each type we can identify the most similar sequence which then is assumed to be of the same type as our HIV sequence.

Comparing two sequences

First we need to write some code that compares two sequences so we can compare our HIV sequence to each of the HIV sequences in the database. Write a function pairWiseDifferences(sequence1, sequence2) that computes the proportion of bases that differ between two DNA sequences of the same lengthsequence1 and sequence2 are strings. Example usage:

d = pairWiseDifferences('AGTC' 'AGTT')

Remember that range(len(sequence1)) generates a list of the indexes you can use to index the string. So you need a for loop in your function and a variable that keeps track of how many differences you see have seen as you iterate through the sequences. This is the variable the function returns.

Read in sequence data

Now download the following files:

The file unknown.txt contain a HIV sequence of unknown type. The other four files contain database HIV sequences of type A, B, C and D. Put them in the folder where you keep your other python code. On most computers you can right-click on the link and have it “save files as…”. Have a look through at what is in the module.

Here is the code you need to read the sequences from the typeA.txt file into a list of sequences called typeA_list.

f = open('typeA.txt', 'r')
typeA_list = list()
for line in f:
seq = line.strip()

You need to do this five times (once for each file) to produce a list of sequences of each known type and a list of sequences of unknown type. Remember to get the modify the code above accordingly, so you produce lists with the following names: typeA_list, typeB_list, typeC_list, typeD_list and unknown_list (the last one will only have one sequence in it).

Comparing aligned sequences

All sequences, including the unknown sequence, are from the same multiple alignment. This means that sequence positions match up across all sequences but also that a lot of gap characters, '-' are inserted.

That means that you need to modify pairWiseDifferences(seq1, seq2) so it does not consider sequence positions where both bases are '-' characters. In other words, you not only need to count differences. You also need to count how many alignment columns that are not "-" for both sequences. E.g. the following mini alignment has five such columns:


Hint: use an if-statement in pairWiseDifferences with a boolean expression like this:

not (seq1[i] == '-' and seq2[i] == '-')

Once your function has computed both the number of differences and the number of relevant positions you can have it return the fraction of different bases as nrDifferences/float(nrRelevantPositions).

Compare your HIV sequence to the HIV sequences with known type

Use the function pairWiseDifferences(seq1, seq2)  to find the database sequence that is most similar to the sequence unknown_list[0] to determine its type.

To type you HIV sequence you need compare your sequence to all the database sequences to see which group has the best matching sequence.

Hint: For each list, e.g. typeA_list, you can go through the aligned sequences with a for loop to calculate the similarity to each sequence.

You can define a list that you can add your results for each type to like this:

similaritiesTypeA = []

and then append results to the list as you calculate them in the for loop. Look up the append() method. Once you are done you need to find the smallest result in the list (fewest differences). Try the min() function.

Solutions to exercise

Code can be downloaded from the last table on the course front page.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: