Munch Lab

Comparing HIV sequences


In this exercise you will work with a HIV sequence. It is a HIV-1 sequence but we also want to know that type it is. To help us type our sequence we have access to an alignment of sequences for which the type is known. By comparing our HIV sequence to the sequences of each type we can identify the most similar sequence which then is assumed to be of the same type as our HIV sequence.

Comparing two sequences

First we need to write some code that compares two sequences so we can compare our HIV sequence to each of the HIV sequences in the database. Write a function pairWiseDifferences(sequence1, sequence2) that computes the proportion of bases that differ between two DNA sequences of the same lengthsequence1 and sequence2 are strings. Example usage:

d = pairWiseDifferences("AGTC" "AGTT")

Remember that range(len(sequence1)) generates a list of the indexes you can use to index the string. So you need a for loop in your function and a variable that keeps track of how many differences you see have seen as you iterate through the sequences. This is the variable the function returns.

Comparing aligned sequences

Now download the module (if you did not do that already) and put it in the folder where you keep your other python code. On most computers you can right-click on the link and have it “save files as…”. Have a look through at what is in the module. Import it like this:

import exerciseWeek3

This module gives you access to exerciseWeek3.alignedHivSeq which is the sequence we want to type. The lists exerciseWeek3.typeAexerciseWeek3.typeBexerciseWeek3.typeC, and exerciseWeek3.typeD contain database HIV sequences of different known types. All sequences are from the same multiple alignment. This means that sequence positions match up across all sequences but also that a lot of gap characters, ‘-‘ are inserted.

That means that you need to modify pairWiseDifferences(seq1, seq2) so it does not consider sequence positions where both bases are ‘-‘ characters. In other words, you not only need to count differences. You also need to count how many alignment columns that are not “-” for both sequences. E.g. the following mini alignment has five such columns:


Hint: use an if-statement in pairWiseDifferences with a boolean expression like this:

not (seq1[i] == '-' and seq2[i] == '-')

Once your function has computed both the number of differences and the number of relevant positions you can have it return the fraction of different bases as nrDifferences/float(nrRelevantPositions).

Compare your HIV sequence to the HIV sequences with known type

Use the function pairWiseDifferences(seq1, seq2)  to find the database sequence that is most similar to exerciseWeek3.alignedHivSeq to determine its type.

To type you HIV sequence you need compare your sequence to all the database sequences to see which group has the best matching sequence.

Hint: For each list, e.g. exerciseWeek3.typeA, you can go through the aligned sequences with a for loop to calculate the similarity to each sequence.

You can define a list that you can add your results for each type to like this:

similaritiesTypeA = []

and then append results to the list as you calculate them in the for loop. Look up the append() method. Once you are done you need to find the smallest result in the list (fewest differences). Try the min() function.

Solutions to execise

Not available yet

One thought on “Comparing HIV sequences”

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: