Munch Lab

Applied programming 2015 week five

Reading

There is no reading for this week. Spend the extra time looking back on the chapters slides and exercises we have covered so far.

Lectures

At the Thursday lecture I will talk more about dictionaries.  At the Tuesday lecture we will talk about how to put all the things together that we have learned so far, and a bit about modules.

Exercises  (TØ)

The exercise for this week is about analyzing the base composition of HIV sequences and. It is available as a link on the main course page in the table showing the course outline.

Mandatory assignment

Write a function,  parse_fasta(filename),  that should read the file named filename. This file should contain multiple sequence entries in Fasta format and the function must parse this data and return a list of tuples of the form (header, sequence). Download this file and use as input. The first entry in the file is:

>numberOne this is sequence one
AGTTTCCCTCAAATCACTCTTTGGCAACGACCCATCGTCACAGTAAGAAT
AGAGGGACAGCTAAGAGAAGCTCTATTAGATACAGGAGCAGACGATACAG
TATTAGAAGACATAGATTTGCCAGGAAAATGGAAACCAAGAATGATAGGG
GGAATTGGAGGCTTCATCAAGGTAAAACAGTATGATCAGATATCTATAGA
AATTTGTGGAAAAAGAGCTATAGGTACAGTATTAGTAGGACCTACACCTG
TCAACATAATTGGAAGAAACATGATGACGCAGATTGGCTGTACTTTAAAT
TTGGCAATTAGTCCTATTGAGACTGTACCAGTAAAATTAAAGCCAGGAAT
GGATGGGCCAAAGGTTAAACAATGGCCACTGACAGCAGAAAAAATAAAAG
ATTGGGCCTGAAAATCCA

so this tuple should be the first one in your list:

("numberOne this is sequence one", "AGTTTCCCTCAAATCACTCTTTGGCAACGACCCATCGTCACAGTAAGAATAGAGGGACAGCTAAGAGAAGCTCTATTAGATACAGGAGCAGACGATACAGTATTAGAAGACATAGATTTGCCAGGAAAATGGAAACCAAGAATGATAGGGGGAATTGGAGGCTTCATCAAGGTAAAACAGTATGATCAGATATCTATAGAAATTTGTGGAAAAAGAGCTATAGGTACAGTATTAGTAGGACCTACACCTGTCAACATAATTGGAAGAAACATGATGACGCAGATTGGCTGTACTTTAAATTTGGCAATTAGTCCTATTGAGACTGTACCAGTAAAATTAAAGCCAGGAATGGATGGGCCAAAGGTTAAACAATGGCCACTGACAGCAGAAAAAATAAAAGATTGGGCCTGAAAATCCA")

This assignment can be solved in a lot of different ways — of varying complexity — so be inventive and after you have handed in your own assignment, have a look at what your friends have done to solve the problem.

Here is one approach: read the entire content of the file into a string using the read()
method of the opened file. Then you can use the split() method to split the the string
into a list of individual Fasta entries. Try using '>' as argument to the split method then and print the resulting list. Now you can use a for loop to iterate over the elements of this list (that are each strings representing a Fasta entry)  so you can produce tuple for each with the header and the sequence. Hint: use the splitlines() method to split each string (Fasta entry) into the individual lines it contains (i.e. the header line and all the sequence lines). Then you can fish out the header line from that list. To produce the sequence you need to join all the the sequence lines.  Here is some code to get you started:

fasta_file = open('input.fasta', 'r')
file_content = fasta_file.read()
list_of_entries = file_content.split(">"):
for entry in list_of_entries:
    print entry # to see what entry is

# (figure out why the first is empty and remove it before the loop)
# here you are on your own but try the splitlines method...

Still, there are many other, and better, ways to do it. You can think about a way once you are done with the large exercise this week.

 

Send the file with python code to Dan no later than Tuesday (24/11) at 12.00 (not 24.00).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: