Munch Lab

SAP version 1.9.3

I have brushed off SAP once again to get rid of the bit rot that sneaks into any software that relies on external services. Bugs have been fixed and new functionality have been implemented. Three things worth mentioning:

Most importantly SAP can now compile a local database for you, so you do not have to do blast and retrieve information from NCBI. This makes SAP much faster and, if you run SAP all the time as part of a pipeline or something, it keeps you from getting blacklisted at NCBI. To compile a database you just specify an Entrez query specifying the part of GenBank you want in your local database. E.g. ‘COI[Gene Name] AND Aves[ORGN]’ will get you a database with all the bird COI genes. You run it like this:

sap --compile 'COI[Gene Name] AND Aves[ORGN]' --database Aves_COI.fasta

Here the argument for the database option specifies the name of the special FASTA file generated.  To run SAP against the local database you go:

sap --database Aves_COI.fasta query.fasta

where query.fasta is the fast file holding the sequences you want to assign.

Another new feature is a more clean table of results that is also available as a csv file for importing into Excel. To get the latest version of SAP you must clone it from github:

git clone https://github.com/kaspermunch/sap.git

Lastly, since all browsers now support SVG, SVG rendering of assignment summary trees is now default.

11 thoughts on “SAP version 1.9.3”

  1. Hi Kasper
    I’m trying to construct a local reference for SAP. The headers of my references looks like this:
    >0; genus: Acipenser, species: Acipenser brevirostrum; Acipenser_brevirostrum
    >1; genus: Acipenser, species: Acipenser medirostris; Acipenser_medirostris

    >154; genus: Pimephales, species: Pimephales promelas; Pimephales_promelas

    SAP spit out an error like this:
    sap –database /Home/References/Inhouse-references/Ref12S.rename.txt –project local 12S_otusn.fasta
    Checking cache for deprecated entries

    Loading database… done
    12s_otusn -> OTU_1:
    Retrieval of homologs:
    Entry status: (c)=cached, (d)=downloaded, (l)=local
    Error types:
    (!D)=Download error, (!?)=Unknown error
    (!T)=Taxonomic annotation problem

    Using cached Blast results… done.
    154;(l)
    ## SAP crashed, sorry ###################################################
    Help creating a more stable program by sending all the debugging information
    between the lines and your SAP version number to kaspermunch@gmail.com along
    with *.sap file in the project folder and the sequence input file used.

    File “/Home/SAP/sap-master-1.9.3/lib64/python2.6/site-packages/SAP-1.9.3-py2.6-linux-x86_64.egg/SAP/ConsoleScripts.py”, line 254, in sap
    homologyResult = homolcompiler.compileHomologueSet(fastaRecord, fastaFileBaseName)
    File “/Home/SAP/sap-master-1.9.3/lib64/python2.6/site-packages/SAP-1.9.3-py2.6-linux-x86_64.egg/SAP/Homology.py”, line 333, in compileHomologueSet
    alignedHomol = str(alignment.matrix[gi])

    u’154;’
    #########################################################################

    I wonder if it’s the wrong headers of my sequences that cause this problem? Would you give me any advice on this? Thank you in advance.

    1. Hi Yiyuan Li,

      You are using the wrong kind of quotes. It should be the ascii ones ( ‘ and not ’ ). And you you should put a space on either side of your ‘;’. Hope it helps.

  2. Hi Kasper,
    We’re trying to run the compile command, but we get a crashing error. Subsequent attempts to run the compile command also result in the crash. It looks like genbank is cutting off access?

    for us, a simple workaround would be if we can have an example file of the reference dataset. We have our own datasets, and we just want to format properly.

    thank you in advance!

    doug

    b042@login00 ~]$ interactive
    Job is submitted to queue .
    <>
    <>
    [b042@cn107 ~]$ module load sap/1.9.3
    [b042@cn107 ~]$
    [b042@cn107 ~]$ sap –compile ‘COI[Gene Name] AND Aves[ORGN]’ –database Aves_COI.fasta
    Query: COI[Gene Name] AND Aves[ORGN]
    total nr of entries for download: 21743. Proceed? yes/no: yes
    Downloading
    [=====================================99%====================================> ]
    ## SAP crashed, sorry ###################################################
    Help creating a more stable program by sending all the debugging information
    between the lines and your SAP version number to kaspermunch@gmail.com along
    with *.sap file in the project folder and the sequence input file used.

    File “/gpfs/grace/python-2.7.6/lib/python2.7/site-packages/SAP-1.9.3-py2.7-linux-x86_64.egg/SAP/ConsoleScripts.py”, line 51, in sap
    compileDatabase(options.compile, options.email, options.database)
    File “/gpfs/grace/python-2.7.6/lib/python2.7/site-packages/SAP-1.9.3-py2.7-linux-x86_64.egg/SAP/CompileDatabase.py”, line 243, in compileDatabase
    taxid2gi, not_downloaded = retrieve_sequence_records(query_key, webenv, count, temp)
    File “/gpfs/grace/python-2.7.6/lib/python2.7/site-packages/SAP-1.9.3-py2.7-linux-x86_64.egg/SAP/CompileDatabase.py”, line 78, in retrieve_sequence_records
    webenv=webenv, query_key=query_key)
    File “/gpfs/grace/python-2.7.6/lib/python2.7/site-packages/SAP-1.9.3-py2.7-linux-x86_64.egg/SAP/Bio/Entrez/__init__.py”, line 149, in efetch
    return _open(cgi, variables, post)
    File “/gpfs/grace/python-2.7.6/lib/python2.7/site-packages/SAP-1.9.3-py2.7-linux-x86_64.egg/SAP/Bio/Entrez/__init__.py”, line 462, in _open
    handle = _urlopen(cgi)
    File “/gpfs/grace/python-2.7.6/lib/python2.7/urllib2.py”, line 127, in urlopen
    return _opener.open(url, data, timeout)
    File “/gpfs/grace/python-2.7.6/lib/python2.7/urllib2.py”, line 404, in open
    response = self._open(req, data)
    File “/gpfs/grace/python-2.7.6/lib/python2.7/urllib2.py”, line 422, in _open
    ‘_open’, req)
    File “/gpfs/grace/python-2.7.6/lib/python2.7/urllib2.py”, line 382, in _call_chain
    result = func(*args)
    File “/gpfs/grace/python-2.7.6/lib/python2.7/urllib2.py”, line 1214, in http_open
    return self.do_open(httplib.HTTPConnection, req)
    File “/gpfs/grace/python-2.7.6/lib/python2.7/urllib2.py”, line 1184, in do_open
    raise URLError(err)

    #########################################################################

    1. It looks strange. It would be nice of you to let me know if the problem persists. Here is an example of the database format produced by the new compile command:

      >403488285 ; superkingdom: Eukaryota, kingdom: Metazoa, phylum: Chordata, subphylum: Craniata, class: Aves, superorder: Neognathae, order: Passeriformes, superfamily: Passeroidea, family: Passeridae, genus: Passer ; Passer montanus
      gtccttgtagcttataaaaagcatgacactgaagatgtcaagatggctgccacacacacccaaggacaaaagacttagtcctaaccttactgttagtttttgctaggtatatacatgcaagtatccgcgctccagtgtagacgccctggacaccttaactcaggtagataggagcag

  3. Hi Kasper
    I have a question about the homolog searching of SAP. What is the default setting of how many homologs I will get from SAP? Also is there any default setting for –individuals INDIVIDUALS?

    Thank you!
    yy

    1. The options for compiling the set of homologs have the following defaults: –besthits 30 –alignmentlimit 50 –individuals 1 –phyla 2 –classes 3 –orders 5 –families 6 –genera 10 –minimaltaxonomy 5

      1. For the –individuals 1, the manual says “Number of best matching individuals”.

        Does that mean if there are several identical sequences of the same species, SAP will just take one of them. If there are several haplotypes of the same species, then it will take all the haplotypes?

        yy

  4. Hi Kasper,
    I’m trying to set the –minidentity 0.97 as it’s the threshold of my OTU clustering. If my understanding is correct, the –minidentity flag uses similar sequence identity as I can get from web-based NCBI blast. But from the SAP homolog searching, there’s still homologs that is below this threshold such as 91%, 93% and 94%. Would you give me any suggestion on this? Thank you in advance

    yy

    1. Hi Yiyuan,

      The –minidentity option specifies the “Minimum global alignment similarity of best blast hit”. So it is the minimal accepted identity in the global clustalw alignment of the query and the homolog identified using blast – it does not refer to the identity of the local alignments found by blast. Hope this helps.

      K

      1. Kasper,
        I put one test sequence in SAP here. When I BLAST it to NCBI nr database, only 3 sequences have identity > 97%. All of them belongs to Lepomis macrochirus. My expectation is that the sequence should belong to Lepomis macrochirus with posterior probability 1.0. But when I read the output of SAP, it included not only the sequences > 97% on NCBI, it also included other species into consideration. The result is the posterior probability is only 0.006. I wonder if I misinterpreted the –minidentity flag here?

        The sequence I’m using is:
        >OTU_7
        AGAGGCTCAAGTTGATGAACCCCGGCGTAAAGAGTGGTTAAGGGAGATCAAAACTAAAGCCGAATGCTTTCAAAGCTGTTATACGCTTCCGAAAGTAAGAAGCCCAATCACGAAAGTGGCTTTACTTTACCTGACCCCACGAAAGCTACGACACAAACTGGGATTAGATACCCCACTATGCCTAGCCTTAAACATTGGCAACACTTTACACCTGCTGCCCGCCAGGAAACTACGAGCATTA

        Thank you for any advice!

        yy

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: