Ulrich, L.E. and Zhulin, I.B. Bioinformatics (2014)

SeqDepot

Introduction

SeqDepot was borne out of a need to easily and rapidly access precomputed data for protein sequences. Moreover, it is computationally wasteful to repeatedly analyze identical sequences that yield the same results. Many useful resources with similar goals already exist, so why build another one?

  1. Inability to add new sequences: we need to incorporate novel sequences on demand to support downstream projects (processing new genomes into MiST)
  2. Limited precomputed data: again, we have various projects that require additional predictive data not supported by or available to third party resources
  3. Inadequate query API: poorly implemented, external dependencies, slow, unable to query large numbers of sequences, and importantly, not easily consumable (i.e. requires significant programming to utilize)

Standing on the shoulders of giants

Rather than starting from scratch, SeqDepot builds on the value already provided by great projects like SIMAP, which is the source for a large portion of the core SeqDepot database. A flexible computational pipeline in conjunction with access to a high-performance compute cluster enable us to seamlessly add datasets derived with custom tools. A well-documented and powerful RESTful API enables complete access to the entire scientific community and returns easily consumable JSON responses.

Infrastructure

At its core, the SeqDepot database is quite simple. A central MongoDB database instance stores protein sequences, a variety of precomputed feature data, and various cross-references to external databases. Sequence and annotations are merged from the SIMAP project, UniProt, the NCBI non-redundant database, and the PDB database. All sequences are uniquely identified by Aseq IDs. A few more features are then predicted on a local compute cluster. All information is exposed via a RESTful API.

Update process

  1. Download all features and sequences from the SIMAP FTP site
  2. Download the latest PDB database
  3. Download the latest NR database
  4. Extract UniProt identifiers from the simap sequences file
  5. All sequences, features, and cross-references are then merged into a single master JSON file
  6. The aggregate JSON data file is then merged into the existing SeqDepot database
  7. All novel sequences are then further analyzed on the Newton high-performance compute cluster:

System details

Hardware
  • (4) Intel 64-bit X5647 quad core processors
  • 32 GB RAM
  • 1 TB SSD hard drive space
Software

Intrinsic identifiers and the Aseq ID

Sequence identification is vital to any project involving DNA or protein sequences. The large number of available sequence databases inevitably results in a huge number of proprietary identifiers. Mapping sequences from one database to another involves extensive cross-referencing. The problem is so wide-spread that various services such as the Protein Identifier Cross-Reference (PICR) have been developed to cross-reference more than 100 distinct databases.

Ideally, each sequence could be identified using only its sequence characters - in other words, an intrinsic identifier. Several algorithms from the cryptographic community do just this with varying degrees of success. For example, digesting the sequence with the MD5 hashing algorithm results in a small character string that uniquely identifiers the sequence - any change, even to a single character, will produce a different digest. The only weakness of such an approach are "collisions" - two distinct sequences that produce the same digest; however, this is negligible because a collision is very unlikely to occur and no collisions have yet been found in all currently known sequence databases. Stronger algorithms such as SHA-1 / SHA-2 / etc are less likely to collide; however, they also produce longer character strings, which demand more memory when populating a database engine.

Aseq ID

Each sequence in the SeqDepot database is uniquely identified by an Aseq ID, which is simply derived as follows (see table below for examples):

  1. Remove all non-sequence characters from the sequence
  2. Upper-case the sequence
  3. Generate a MD5 digest
  4. Encode the result in Base64 (smaller than its hexadecimal equivalent and human readable)
  5. Remove any padding characters (usually equal signs)
  6. Convert all forward slahes and plus signs to URL "friendly" underscores and dashes, respectively
Generation of Aseq ID from raw sequences
Sequence Base64 encoded MD5 Aseq ID
>Tar
MINRIRVVTLLVMVLGVFALLQLISGSLF
FSSLHHSQKSFVVSNQLREQQGELTSTWD
LMLQASTALNKAGTLTALSYPADDIKTLM
...
/wXylOa/eoFtpjBR9hTF2A== _wXylOa_eoFtpjBR9hTF2A
>CheW
MTGMTNVTKLASEPSGQEFLVFTLGDEEY
GIDILKVQEIRGYDQVTRIANTPAFIKGV
TNLRGVIVPIVDLRIKFSQVDVDYNDNTV
VIVLNLGQRVVGIVVDGVSDVLSLTAEQI
RPAPEFAVTLSTEYLTGLGALGDRMLILV
NIEKLLNSEEMALLDSAASEVA
/VZVW/e+iEqzExAuqrwEhQ== _VZVW_e-iEqzExAuqrwEhQ
>CheA
MSMDISDFYQTFFDEADELLADMEQHLLV
LQPEAPDAEQLNAIFRAAHSIKGGAGTFG
FSVLQETTHLMENLLDEARRGEMQLNTDI
INLFLETKDIMQEQLDAYKQSQEPDAASF
DYIC...
fiUs+3vh34LxGVAdbheipg== fiUs-3vh34LxGVAdbheipg

The result is an Aseq ID: a source-agnostic, 22-character string derived solely from the sequence characters that uniquely identifies a specific amino acid sequence (note: this approach may easily be applied to DNA / RNA sequences as well). Aseq IDs circumvent the need to maintain various cross-references to external databases and simplify sequence identification. Moreover, it is small enough to be readily indexed by database systems without overly consuming large amounts of RAM.

Local installation

The complete MongoDB data files comprising SeqDepot may be downloaded and used with a local MongoDB instance. Simply follow the steps below:

  1. Requirements:
    • At least 100GB free hard drive space (preferably 150GB or more)
    • As much RAM as possible
  2. Follow these instructions to install MongoDB
  3. Download the SeqDepot Mongo Database
  4. Decompress the seqdepot tarball:
    $ tar xvf seqdepot.latest.tar.gz
  5. Start the mongodb server:
    $ cd seqdepot
    $ mongod --dbpath .
  6. Begin interacting with the database. For example, open a terminal, and try connecting with the mongo client:
    $ mongo seqdepot

sdQuery.pl

sdQuery.pl is a command line program written in Perl for easily retrieving data from the SeqDepot database. sdQuery.pl reads an input file(s) of identifiers or FASTA sequences, queries SeqDepot for matches, and outputs the results. It may be easily used for one-off tasks or integrated into a computational pipeline.

Suppored input data
  • FASTA
  • Aseq ID
  • GI
  • PDB ID
  • UniProt ID
  • MD5 hexadecimal digests
Output formats
  • JSON
  • PNG or SVG images (saved to separate files)
  • Tab-delimited (only applies to cross-references)

Installation

  1. Download sdQuery.pl and SeqDepot.pm to the same directory
  2. Run sdQuery.pl:
    $ ./sdQuery.pl

Usage

Simply run sdQuery.pl without any arguments or the -h flag to view the usage:

$ ./sdQuery.pl

Usage: sdQuery.pl [options] <input file>[ <input file> ...]

  To process via STDIN, provide - in place of all <input file>.

  Available options:
  ------------------
    -h, --help                  : This help page.
    -t, --type = <string>       : Type of input data. Acceptable values are
                                  fasta, aseq_id, gi, uni (UniProt ID), pdb,
                                  or md5_hex. Will guess if not explicitly
                                  specified. If not fasta, all ids must
                                  exist on separate lines.
    -f, --fields = <string>     : List of fields to pull down from SeqDepot.
                                  By default, all available fields are
                                  requested.
    -a, --array-to-hashes       : Convert all pre-computed array data to an
                                  array of hashes.
    -o, --out = <file>          : Redirect all output to <file> instead of
                                  the console.
    -u, --outtype = <string>    : Type of data to output. Acceptable values
                                  are json, json_per_line, fasta, png, svg,
                                  or xrefs (TSV). Defaults to json. If xrefs
                                  is used, then the xrefs option (-x, --xrefs)
                                  must be specified.
    -d, --image-dir = <path>     : Directory to save images to. Applies only
                                  if the outType is png or svg. Defaults to
                                  the current directory.
    -n, --image-file-pattern = <string>
                                : Filename pattern to use when saving images
                                  to disk. The special variable ${ID} will
                                  be replaced with the query identifier. In
                                  the case of FASTA input, this is its Aseq
                                  ID. Alternatively, ${FASTA_HEADER} may be
                                  used with fasta input to use the fasta
                                  header as the base file name. Defaults to
                                  ${ID}.
    -x, --xrefs = <string>      : One or more comma-separated database names
                                  to cross-reference. Acceptable values are
                                  gi, uni, and pdb. Only relevant if outtype
                                  is set to xrefs.
    -p, --pretty-json           : If outtype is json, then pretty print the
                                  results.

Additional option details

-t, --type = <string>

If set to fasta, sdQuery generates the Aseq ID corresponding to each sequence and uses this to find matching records. No header data is used for identification purposes.

Otheriwse, if input consists of various identifers, they must be placed on separate lines and at the beginning of each line.

-f, --fields = <string>

Comma-separated field string (see schema); secondary fields must be listed in parentheses, separated with pipe (|) symbols, and immediately suffix the respective primary field name. For example,

s,lReturns the sequence and length primary fields
l,t,xReturns the length, all tool data, and all cross-references
l,t(pfam26|smart)Returns the length primary field, and pfam and smart secondary fields of t
l,t(pfam26|smart),x(gi)Same as the above but include any GI identifiers
-a, --array-to-hashes

This option transforms any requested precomputed data from an array of arrays into an array of hashes when the output type is either json or json_per_line.

To reduce the required storage space, all precomputed data is stored as an array of arrays (typically, an analytical tool may produce multiple results per sequence, and each result may have multiple attributes). For example, the DAS transmembrane prediction tool may predict several transmembrane regions with each described by five fields (start, stop, peak, peak_score, and evalue). Here is a partial record with DAS results without using the -a, --array-to-hashes option:

{
    "l": 894,
    "id": "naytI0dLM_rK2kaC1m3ZSQ",
    "t": {
        "das": [
            [
                403,
                423,
                411,
                4.116,
                0.0006308
            ],
            [
                425,
                445,
                434,
                5.243,
                1.185e-5
            ], ...

With the -a, --array-to-hashes option, the above would look like:

{
    "l": 894,
    "id": "naytI0dLM_rK2kaC1m3ZSQ",
    "t": {
        "das": [
            {"start": 403,
             "stop": 423,
             "peak": 411,
             "peak_score": 4.116,
             "evalue": 0.0006308
            },
            {"start": 425,
             "stop": 445,
             "peak": 434,
             "peak_score": 5.243,
             "evalue": 1.185e-5
            }, ...
-u, --outtype = <string>

If fasta, then each header line will be terminated with a numeric HTTP status code (200 if sequence exists in SeqDepot, 404 if not found, or 500 if a server error occurred). If the status is 200, a JSON string containing the requested data will also be appended.

If json_per_line, then each result (for each query) will be returned as an independent JSON encoded object on its own line. This is useful when dealing with many sequences because parsers can begin processing after each line has been read. In contrast, using json for this option, returns a single JSON encoded object, and it is necessary to first receive the entire response before any processing may begin.

If png or svg, then an image file will be created in the current directory (or value specified by the -d, --image-dir) for each matching sequence in SeqDepot.

sdQuery Examples

The example commands reference the following input files:

  • gis.txt
    $ cat gis.txt
    300937843
    1346374
  • uni.txt
    $ cat uni.txt
    
    F5CGH2_9HIV1
  • seqs.faa
    $ cat seqs.faa
    
    >accession:NP_415222.1|locus:b0694|genome:Escherichia coli str. K-12 substr. MG1655
    MTNVLIVEDEQAIRRFLRTALEGDGMRVFEAETLQRGLLEAATRKPDLIILDLGLPDGDG
    IEFIRDLRQWSAVPVIVLSARSEESDKIAALDAGADDYLSKPFGIGELQARLRVALRRHS
    ATTAPDPLVKFSDVTVDLAARVIHRGEEEVHLTPIEFRLLAVLLNNAGKVLTQRQLLNQV
    WGPNAVEHSHYLRIYMGHLRQKLEQDPARPRHFITETGIGYRFML
    
    >b0695 [Escherichia coli str. K-12 substr. MG1655]
    MNNEPLRPDPDRLLEQTAAPHRGKLKVFFGACAGVGKTWAMLAEAQRLRAQGLDIVVGVV
    ETHGRKDTAAMLEGLAVLPLKRQAYRGRHISEFDLDAALARRPALILMDELAHSNAPGSR
    HPKRWQDIEELLEAGIDVFTTVNVQHLESLNDVVSGVTGIQVRETVPDPFFDAADDVVLV
    DLPPDDLRQRLKEGKVYIAGQAERAIEHFFRKGNLIALRELALRRTADRVDEQMRAWRGH
    PGEEKVWHTRDAILLCIGHNTGSEKLVRAAARLASRLGSVWHAVYVETPALHRLPEKKRR
    AILSALRLAQELGAETATLSDPAEEKAVVRYAREHNLGKIILGRPASRRWWRRETFADRL
    ARIAPDLDQVLVALDEPPARTINNAPDNRSFKDKWRVQIQGCVVAAALCAVITLIAMQWL
    MAFDAANLVMLYLLGVVVVALFYGRWPSVVATVINVVSFDLFFIAPRGTLAVSDVQYLLT
    FAVMLTVGLVIGNLTAGVRYQARVARYREQRTRHLYEMSKALAVGRSPQDIAATSEQFIA
    STFHARSQVLLPDDNGKLQPLTHPQGMTPWDDAIAQWSFDKGLPAGAGTDTLPGVPYQIL
    PLKSGEKTYGLVVVEPGNLRQLMIPEQQRLLETFTLLVANALERLTLTASEEQARMASER
    EQIRNALLAALSHDLRTPLTVLFGQAEILTLDLASEGSPHARQASEIRQHVLNTTRLVNN
    LLDMARIQSGGFNLKKEWLTLEEVVGSALQMLEPGLSSPINLSLPEPLTLIHVDGPLFER
    VLINLLENAVKYAGAQAEIGIDAHVEGENLQLDVWDNGPGLPPGQEQTIFDKFARGNKES
    AVPGVGLGLAICRAIVDVHGGTITAFNRPEGGACFRVTLPQQTAPELEEFHEDM
    
    >KdpC
    MSGLRPALSTFIFLLLITGGVYPLLTTVLGQWWFPWQANGSLIREGDTVRGSALIGQNFT
    GNGYFHGRPSATAEMPYNPQASGGSNLAVSNPELDKLIAARVAALRAANPDASASVPVEL
    VTASASGLDNNITPQAAAWQIPRVAKARNLSVEQLTQLIAKYSQQPLVKYIGQPVVNIVE
    LNLALDKLDE
    
  • seqs-named.faa
    $ cat seqs-named.faa
    
    >b0695
    MNNEPLRPDPDRLLEQTAAPHRGKLKVFFGACAGVGKTWAMLAEAQRLRAQGLDIVVGVV
    ETHGRKDTAAMLEGLAVLPLKRQAYRGRHISEFDLDAALARRPALILMDELAHSNAPGSR
    HPKRWQDIEELLEAGIDVFTTVNVQHLESLNDVVSGVTGIQVRETVPDPFFDAADDVVLV
    DLPPDDLRQRLKEGKVYIAGQAERAIEHFFRKGNLIALRELALRRTADRVDEQMRAWRGH
    PGEEKVWHTRDAILLCIGHNTGSEKLVRAAARLASRLGSVWHAVYVETPALHRLPEKKRR
    AILSALRLAQELGAETATLSDPAEEKAVVRYAREHNLGKIILGRPASRRWWRRETFADRL
    ARIAPDLDQVLVALDEPPARTINNAPDNRSFKDKWRVQIQGCVVAAALCAVITLIAMQWL
    MAFDAANLVMLYLLGVVVVALFYGRWPSVVATVINVVSFDLFFIAPRGTLAVSDVQYLLT
    FAVMLTVGLVIGNLTAGVRYQARVARYREQRTRHLYEMSKALAVGRSPQDIAATSEQFIA
    STFHARSQVLLPDDNGKLQPLTHPQGMTPWDDAIAQWSFDKGLPAGAGTDTLPGVPYQIL
    PLKSGEKTYGLVVVEPGNLRQLMIPEQQRLLETFTLLVANALERLTLTASEEQARMASER
    EQIRNALLAALSHDLRTPLTVLFGQAEILTLDLASEGSPHARQASEIRQHVLNTTRLVNN
    LLDMARIQSGGFNLKKEWLTLEEVVGSALQMLEPGLSSPINLSLPEPLTLIHVDGPLFER
    VLINLLENAVKYAGAQAEIGIDAHVEGENLQLDVWDNGPGLPPGQEQTIFDKFARGNKES
    AVPGVGLGLAICRAIVDVHGGTITAFNRPEGGACFRVTLPQQTAPELEEFHEDM
    
    >KdpC
    MSGLRPALSTFIFLLLITGGVYPLLTTVLGQWWFPWQANGSLIREGDTVRGSALIGQNFT
    GNGYFHGRPSATAEMPYNPQASGGSNLAVSNPELDKLIAARVAALRAANPDASASVPVEL
    VTASASGLDNNITPQAAAWQIPRVAKARNLSVEQLTQLIAKYSQQPLVKYIGQPVVNIVE
    LNLALDKLDE
    
Fetch complete JSON record for GI identifiers
$ ./sdQuery.pl gis.txt

[{"query":"300937843","data":{"l":225,"_s":"TdddTdT-TddTTdTTddd","id":"yg8A8H8N-4x1Ezf8WW-YbA","x":{"gi":[16128670,170080361,170682921,188495689,218699050,238899960,300937843,300951198,300959271,301028821,301645940,312970765,331641191,386279706,386596462,387611184,387620427,388476786,404374022,415776911,417128818,417263781,417274153,417275078,417289965,417611706,417617084,417633157,417946768,417978416,418301546,418959019,419141211,419146606,419152564,419158008,419162933,419813292,422765222,422791470,422816668,422827886,423701438,425114035,425118795,425271370,425282045,432415614,432562567,432579346,432601223,432626240,432635967,432659921,432679116,432684496,432690585,432703233,432736199,432880151,432953818,433046822,442595712,2507374,1786911,85674742,169888196,170520639,188490888,218369036,238860865,260450151,299878183,300314143,300449540,300457127,301075799,309700920,310337414,315135350,315616391,323938337,323972049,331037989,339413644,342361480,344191917,345366191,345380958,345390827,359331399,371616312,377999426,378001534,378003302,378013164,378016324,384378190,385153832,385540141,385712792,386123258,386143774,386222669,386232581,386241731,386256003,404292509,408198433,408205813,408572529,408573073,430943990,431099800,431109048,431143435,431165036,431174249,431203284,431224514,431224622,431230497,431246723,431286103,431413775,431470314,431571450,441604263],"uni":["B1LLD8_ECOSM","B1X6M5_ECODH","B2N7S1_ECOLX","B7NMP7_ECO7I","C4ZWH0_ECOBW","C9R0S5_ECOD1","D8A890_ECOLX","D8AUM7_ECOLX","D8BAH1_ECOLX","D8C3V7_ECOLX","E1HN88_ECOLX","E2WSA7_ECOLX","E3PH87_ECOH1","E6B6M5_ECOLX","E9WC57_ECOLX","E9YE81_ECOLX","F4SKQ2_ECOLX","F9R4B0_ECOLX","G0FEX8_ECOLX","G2ARB8_ECOLX","G2B6K4_ECOLX","G2CGW5_ECOLX","G2F843_ECOLX","H0QDN9_ECOLI","H1DN11_ECOLX","H4UG14_ECOLX","H4UWV4_ECOLX","H4VCZ8_ECOLX","H4VTC8_ECOLX","H4W7A9_ECOLX","I0ZPA7_ECOLX","I2HYP8_ECOLX","I2PSR2_ECOLX","I2R5Q3_9ESCH","I2RQI4_ECOLX","I2X7W8_ECOLX","I2Y0L1_ECOLX","I2YRK8_ECOLX","I2ZXA6_ECOLX","I4JCK9_ECOLX","KDPE_ECOLI"]},"s":"MTNVLIVEDEQAIRRFLRTALEGDGMRVFEAETLQRGLLEAATRKPDLIILDLGLPDGDGIEFIRDLRQWSAVPVIVLSARSEESDKIAALDAGADDYLSKPFGIGELQARLRVALRRHSATTAPDPLVKFSDVTVDLAARVIHRGEEEVHLTPIEFRLLAVLLNNAGKVLTQRQLLNQVWGPNAVEHSHYLRIYMGHLRQKLEQDPARPRHFITETGIGYRFML","t":{"gene3d":[["3.40.50.2300","",1,122,2.2e-36],["1.10.10.10","winged helix repressor DNA binding domain",126,223,8.1e-28]],"superfam":[["SSF52172","CheY-like",1,189,7.8e-41]],"segs":[[46,61]],"pfam26":[["Response_reg",4,112,"..",0.008,1,111,"[.",4,113,"..",101.91,6.865e-32,1.248e-29,0.982],["Trans_reg_C",148,223,"..",0.026,2,77,".]",146,223,"..",77.617,6.415e-25,3.207e-22,0.973]],"agfam1":[["RR",3,118,"..",1,122,"[]",126.207,2.648e-37]],"smart":[["SM00448","REC",2,112,2.4e-40],["SM00862","Trans_reg_C",147,223,2.1e-22]],"proscan":[["PS50110","RESPONSE_REGULATORY",3,116,40.23]],"panther":[["PTHR26402",1,225,3.8e-86],["PTHR26402:SF259",1,225,3.8e-86]]}},"code":200},{"query":"1346374","data":{"l":894,"_s":"TTTdTdT-TdTTTdTTddT","id":"naytI0dLM_rK2kaC1m3ZSQ","x":{"gi":[16128671,170080362,238899961,300951199,300959272,301028820,301645939,331641192,386279707,386596461,386612864,386703866,387611185,387620428,388476787,417289371,417611707,417946767,417978415,418959018,419152565,419162934,419813291,422816669,423701439,425114036,425118796,432415615,432562568,432626241,432635968,432659922,432684497,432690586,432703234,432736200,432880152,432953819,442595713,1346374,146551,1651302,1786912,169888197,238861252,260450150,299878182,300314144,300449541,301075798,309700921,315135351,331037990,332342033,342361479,344191916,345366192,359331400,378003303,378016325,383102034,384378189,385153831,385540142,385712793,386123259,386255409,408572530,408573074,430943991,431099801,431165037,431174250,431203285,431224623,431230498,431246724,431286104,431413776,431470315,441604264],"uni":["B1X6M6_ECODH","C4ZWH1_ECOBW","C9R0S4_ECOD1","D8AUM8_ECOLX","D8BAH2_ECOLX","D8C3V6_ECOLX","E1HN87_ECOLX","E3PH88_ECOH1","F4M915_ECOLX","F4SKQ3_ECOLX","F9R4A9_ECOLX","G2ARB9_ECOLX","G2F842_ECOLX","H0QDP0_ECOLI","H4VCZ9_ECOLX","H4W7B0_ECOLX","H9UPV0_ECOLX","I0ZPA6_ECOLX","I2HYP7_ECOLX","I2PSR3_ECOLX","I2R5Q4_9ESCH","I2ZVL2_ECOLX","I4JCL0_ECOLX","KDPD_ECOLI"]},"s":"MNNEPLRPDPDRLLEQTAAPHRGKLKVFFGACAGVGKTWAMLAEAQRLRAQGLDIVVGVVETHGRKDTAAMLEGLAVLPLKRQAYRGRHISEFDLDAALARRPALILMDELAHSNAPGSRHPKRWQDIEELLEAGIDVFTTVNVQHLESLNDVVSGVTGIQVRETVPDPFFDAADDVVLVDLPPDDLRQRLKEGKVYIAGQAERAIEHFFRKGNLIALRELALRRTADRVDEQMRAWRGHPGEEKVWHTRDAILLCIGHNTGSEKLVRAAARLASRLGSVWHAVYVETPALHRLPEKKRRAILSALRLAQELGAETATLSDPAEEKAVVRYAREHNLGKIILGRPASRRWWRRETFADRLARIAPDLDQVLVALDEPPARTINNAPDNRSFKDKWRVQIQGCVVAAALCAVITLIAMQWLMAFDAANLVMLYLLGVVVVALFYGRWPSVVATVINVVSFDLFFIAPRGTLAVSDVQYLLTFAVMLTVGLVIGNLTAGVRYQARVARYREQRTRHLYEMSKALAVGRSPQDIAATSEQFIASTFHARSQVLLPDDNGKLQPLTHPQGMTPWDDAIAQWSFDKGLPAGAGTDTLPGVPYQILPLKSGEKTYGLVVVEPGNLRQLMIPEQQRLLETFTLLVANALERLTLTASEEQARMASEREQIRNALLAALSHDLRTPLTVLFGQAEILTLDLASEGSPHARQASEIRQHVLNTTRLVNNLLDMARIQSGGFNLKKEWLTLEEVVGSALQMLEPGLSSPINLSLPEPLTLIHVDGPLFERVLINLLENAVKYAGAQAEIGIDAHVEGENLQLDVWDNGPGLPPGQEQTIFDKFARGNKESAVPGVGLGLAICRAIVDVHGGTITAFNRPEGGACFRVTLPQQTAPELEEFHEDM","t":{"segs":[[94,107],[166,186],[266,277],[428,441],[711,722]],"pfam26":[["KdpD",21,230,"..",0.006,2,211,".]",20,230,"..",329.179,4.617e-103,4.617e-99,0.995],["HATPase_c",778,881,"..",0.003,6,110,"..",774,882,"..",84.573,1.307e-26,2.421e-24,0.965],["DUF4118",407,499,"..",9.917,5,103,"..",402,501,"..",54.331,1.598e-18,7.991e-15,0.836],["HisKA",664,730,"..",1.214,2,68,".]",663,730,"..",43.184,6.792e-14,1.887e-11,0.878],["GAF_3",528,644,"..",0.002,2,129,".]",527,644,"..",38.631,4.443e-13,6.347e-10,0.855],["Usp",251,365,"..",0.429,3,133,"..",249,373,"..",21.742,1.263e-07,0.0001149,0.847]],"agfam1":[["HK_CA:13",730,881,"..",1,158,"[]",175.069,5.176e-52],["HK_CA:2",730,881,"..",1,161,"[]",123.442,1.8e-36],["HK_CA:5",737,881,"..",1,144,"[]",108.901,4.29e-32]],"das":[[403,423,411,4.116,0.0006308],[425,445,434,5.243,1.185e-05],[448,464,456,3.252,0.01334],[476,493,485,4.305,0.0003238],[850,851,851,2.544,0.1621]],"panther":[["PTHR24423",22,894,1.3e-119],["PTHR24423:SF357",22,894,1.3e-119]],"gene3d":[["3.40.50.300","P-loop containing nucleotide triphosphate hydrolases",21,229,1.2e-100],["3.40.50.620","Tyrosyl-Transfer RNA Synthetase ; subunit E; domain 1",245,352,4.8e-06],["1.20.120.620","Backbone structure of the membrane domain of e. Coli histidine kinase receptor kdpd;",397,502,5.6e-39],["1.10.287.130","",657,726,6e-15],["3.30.565.10","",732,885,7.8e-38]],"superfam":[["SSF52402","Adenine nucleotide alpha hydrolases-like",248,378,5.2e-06],["SSF55781","GAF domain-like",508,659,2.8e-06],["SSF47384","Homodimeric domain of signal transducing histidine kinase",645,732,1.4e-15],["SSF55874","ATPase domain of HSP90 chaperone/DNA topoisomerase II/histidine kinase",719,893,1.5e-41]],"coils":[[642,662]],"tmhmm":[[399,421],[425,444],[449,471],[476,498]],"smart":[["SM00388","HisKA",663,730,1.4e-13],["SM00387","HATPase_c",773,883,4.9e-33]],"prints":[["PR00344","BCTRLSENSOR",810,824,1.6e-12],["PR00344","BCTRLSENSOR",828,838,1.6e-12],["PR00344","BCTRLSENSOR",843,861,1.6e-12],["PR00344","BCTRLSENSOR",867,880,1.6e-12]],"proscan":[["PS50109","HIS_KIN",670,883,45.15]]}},"code":200}]
Fetch length and Gene3D results for GI identifiers
$ ./sdQuery.pl -f "l,t(gene3d)" gis.txt

[{"query":"300937843","data":{"l":225,"id":"yg8A8H8N-4x1Ezf8WW-YbA","t":{"gene3d":[["3.40.50.2300","",1,122,2.2e-36],["1.10.10.10","winged helix repressor DNA binding domain",126,223,8.1e-28]]}},"code":200},{"query":"1346374","data":{"l":894,"id":"naytI0dLM_rK2kaC1m3ZSQ","t":{"gene3d":[["3.40.50.300","P-loop containing nucleotide triphosphate hydrolases",21,229,1.2e-100],["3.40.50.620","Tyrosyl-Transfer RNA Synthetase ; subunit E; domain 1",245,352,4.8e-06],["1.20.120.620","Backbone structure of the membrane domain of e. Coli histidine kinase receptor kdpd;",397,502,5.6e-39],["1.10.287.130","",657,726,6e-15],["3.30.565.10","",732,885,7.8e-38]]}},"code":200}]
Fetch length, smart, and panther results using FASTA
$ ./sdQuery.pl -f "l,t(smart|panther)" seqs.faa

[{"query":"yg8A8H8N-4x1Ezf8WW-YbA","data":{"l":225,"id":"yg8A8H8N-4x1Ezf8WW-YbA","t":{"smart":[["SM00448","REC",2,112,2.4e-40],["SM00862","Trans_reg_C",147,223,2.1e-22]],"panther":[["PTHR26402",1,225,3.8e-86],["PTHR26402:SF259",1,225,3.8e-86]]}},"header":"accession:NP_415222.1|locus:b0694|genome:Escherichia coli str. K-12 substr. MG1655","code":200},{"query":"naytI0dLM_rK2kaC1m3ZSQ","data":{"l":894,"id":"naytI0dLM_rK2kaC1m3ZSQ","t":{"smart":[["SM00388","HisKA",663,730,1.4e-13],["SM00387","HATPase_c",773,883,4.9e-33]],"panther":[["PTHR24423",22,894,1.3e-119],["PTHR24423:SF357",22,894,1.3e-119]]}},"header":"b0695 [Escherichia coli str. K-12 substr. MG1655]","code":200},{"query":"GS8z3QwN5MzpxU0aTuxuaA","data":{"l":190,"id":"GS8z3QwN5MzpxU0aTuxuaA","t":{"panther":[["PTHR30042",1,190,9.7e-95],["PTHR30042:SF0",1,190,9.7e-95]]}},"header":"KdpC","code":200}]
Fetch length, smart, and panther results using FASTA and rename arrays to hashes
$ ./sdQuery -f "l,t(smart|panther)" -p -a seqs.faa

            
Timeout error (e.g. the SeqDepot server is down)
$ ./sdQuery gis.txt

Requesting batch from SeqDepot...

Unable to connect to server; timeout or other internal error
Download PNG domain architecture visualizations for a list of GI numbers
$ ./sdQuery.pl -u png gis.txt

// Creates PNG file: 300937843.png
// Creates PNG file: 1346374.png
Download SVG domain architecture visualizations for FASTA sequences

Note: the file names consist of its Aseq ID because this is the default identifier used when querying with FASTA sequences. See the next example for an alternative naming scheme.

$ ./sdQuery.pl -u svg seqs.faa

// Creates SVG file: yg8A8H8N-4x1Ezf8WW-YbA.svg
// Creates SVG file: naytI0dLM_rK2kaC1m3ZSQ.svg
// Creates SVG file: GS8z3QwN5MzpxU0aTuxuaA.svg
Download PNG domain architecture visualizations for FASTA sequences (using FASTA header as the file name)
$ ./sdQuery.pl -u png -n "${FASTA_HEADER}.png" seqs.faa

// Creates PNG file: b0694.png
// Creates PNG file: b0695.png
// Creates PNG file: KdpC.png
Cross-reference a list of UniProt IDs to PDBs
$ ./sdQuery.pl -u xrefs -x pdb uni.txt

F5CGH2_9HIV1    1dp6,1dp8,1dp9,1drm,1lsv,1lsw,1lsx,1lt0

SeqDepot

We have developed both Perl and Python modules for facilitating many tasks related to interfacing with the SeqDepot server. Additionally, it includes a few subroutines for working with sequence data (e.g. parsing FASTA files).

Features

  • Find sequences using Aseq, MD5 digests, GI, UniProt, or PDB identifiers
  • Retrieve partial or entire records
  • Interconvert Aseq IDs and hexadecimal MD5 digests
  • Derive Aseq IDs and hexadecimal MD5 digests directly from sequences
  • Clean sequences: remove whitespace and replace invalid characters
  • Transform precomputed tool data (stored in arrays without field names) to array of hashes (with meaningful column names)
  • Validation
  • FASTA parser
  • Save PNG or SVG visualizations

Download

Perl
Python

These (and also the Python 2.x module) are also available on the download page.

Requirements

  • Internet connection :)
  • Perl v5.8 or higher
  • The following Perl modules (most are common to a normal Perl installation)
    • Carp
    • Digest::MD5
    • HTTP::Request::Common
    • JSON
    • LWP::UserAgent
    • MIME::Base64
  • The Python module has the following dependencies
    • json
    • hashlib (or md5 if using python 2.x)
    • base64
    • re
    • binascii
    • urllib (or urllib2 if using python 2.x)

Subroutines

aseqIdFromMD5Hex

Description:
Static method that converts an MD5 hexadecimal string into its Aseq ID equivalent.
Parameters:
MD5hex {string} hexademical MD5 digest
Returns:
Aseq ID {string}
Example:
use SeqDepot;
my $aseq_id = SeqDepot::aseqIdFromMD5Hex('ca0f00f07f0dfb8c751337fc596f986c');
print $aseq_id;     # "yg8A8H8N-4x1Ezf8WW-YbA"
import SeqDepot
aseq_id = SeqDepot.aseqIdFromMD5Hex('ca0f00f07f0dfb8c751337fc596f986c')
print(aseq_id)     # "yg8A8H8N-4x1Ezf8WW-YbA"

aseqIdFromSequence

Description:
Static method for computing the aseqId for a given sequence. It is recommended that all sequences are cleaned before calling this method.
Parameters:
sequence {string} ungapped, upper-case amino acid sequence
Returns:
Aseq ID {string}
Example:
use SeqDepot;

my $aseq_id = SeqDepot::aseqIdFromSequence('MTNVLIVEDEQAIRRFLRTALEGDGMRVFEAETLQRGLLEAATRKPDLIILDLGLPDGDGIEFIRDLRQWSAVPVIVLSARSEESDKIAALDAGADDYLSKPFGIGELQARLRVALRRHSATTAPDPLVKFSDVTVDLAARVIHRGEEEVHLTPIEFRLLAVLLNNAGKVLTQRQLLNQVWGPNAVEHSHYLRIYMGHLRQKLEQDPARPRHFITETGIGYRFML');
print $aseq_id;     # "yg8A8H8N-4x1Ezf8WW-YbA"
import SeqDepot

aseq_id = SeqDepot.aseqIdFromSequence('MTNVLIVEDEQAIRRFLRTALEGDGMRVFEAETLQRGLLEAATRKPDLIILDLGLPDGDGIEFIRDLRQWSAVPVIVLSARSEESDKIAALDAGADDYLSKPFGIGELQARLRVALRRHSATTAPDPLVKFSDVTVDLAARVIHRGEEEVHLTPIEFRLLAVLLNNAGKVLTQRQLLNQVWGPNAVEHSHYLRIYMGHLRQKLEQDPARPRHFITETGIGYRFML')
print(aseq_id)     # "yg8A8H8N-4x1Ezf8WW-YbA"

cleanSequence

Description:
Static method for removing all whitespace characters from sequence and replaces all digits or non-word characters with an ampersand character (for easy identification of invalid symbols).
Parameters:
sequence {string}
Returns:
{string}
Example:
use SeqDepot;

my $dirtySequence = "M tn\nVLI";
my $cleanSequence = SeqDepot::cleanSequence($dirtySequence);
print $cleanSequence;     # "MTNVLI"

# Note: the 9 and - characters will be replaced with ampersands (@)
my $sequenceWithInvalidChars = "MTNV 9 L - I";
$cleanSequence = SeqDepot::cleanSequence($sequenceWithInvalidChars);
print $cleanSequence;     # "MTNV@L@I"
import SeqDepot

dirtySequence = "M tn\nVLI"
cleanSequence = SeqDepot.cleanSequence(dirtySequence)
print(cleanSequence)     # "MTNVLI"

# Note: the 9 and - characters will be replaced with ampersands (@)
sequenceWithInvalidChars = "MTNV 9 L - I"
cleanSequence = SeqDepot.cleanSequence(sequenceWithInvalidChars)
print(cleanSequence)     # "MTNV@L@I"

find

Description:

Retrieves one or more records from SeqDepot. Unless otherwise specified (see parameters), all fields are returned by default.

Returns a mixed array of hashes or undefs, indicating whether the respective requested Aseq ID was found (undef meaning the requested Aseq ID was not found - not that some other error occurred).

Parameters:
  1. ids {string | number | array.<string>} one or more sequence identifiers (if multiple, must all be of the same type)
  2. params {hash} (optional) qualifies the find with the following:
    • type {string} identifier type; defaults to aseq_id, but use gi, uni, pdb, or md5_hex for GI, UniProt, PDB, or MD5 hexadecimal identifiers, respectively.
    • fields {string} comma-separated field string (see schema); secondary fields must be listed in parentheses, separated with pipe (|) symbols, and immediately suffix the respective primary field name. For example,

      s,lReturns the sequence and length primary fields
      l,t,xReturns the length, all tool data, and all cross-references
      l,t(pfam26|smart)Returns the length primary field, and pfam and smart secondary fields of t
      l,t(pfam26|smart),x(gi)Same as the above but include any GI identifiers
    • labelToolData {boolean} defaults to false; if true converts any tool data (the t field) into an array of hashes with meaningful field names
Returns:

{undef | array.<hash | undef>}

On success, returns a mixed array of hashes or undefs. A undef value for the nth element indicates that no Aseq record was found for the nth identifier.

Returns undef if a network error occurs. Call lastError to get the error message.

Example:
use SeqDepot;                        

# Retrieve all data for a single sequence by its aseq_id
my $sd = new SeqDepot();
my $aseq_id = "naytI0dLM_rK2kaC1m3ZSQ";
my $aseqs = $sd->find($aseq_id);     # [{_id => "naytI0dLM_rK2kaC1m3ZSQ",
                                     #   l => 894,
                                     #   ... }
                                     # ]

# Retrieve the sequence length (l) for 2 GI identifiers and one invalid GI
my $gis = [300937843, 1346374, -2345324];
my $aseqs = $sd->find($gis, {type => 'gi', fields => 'l'});
if ($aseqs) {
    # [ {"l":225,"id":"yg8A8H8N-4x1Ezf8WW-YbA"},
    #   {"l":894,"id":"naytI0dLM_rK2kaC1m3ZSQ"},
    #   undef ]
}
else {
    print $sd->lastError();
}
import SeqDepot

# Retrieve all data for a single sequence by its aseq_id
sd = SeqDepot.new()
aseq_id = "naytI0dLM_rK2kaC1m3ZSQ"
aseqs = sd.find(aseq_id)     # [{_id => "naytI0dLM_rK2kaC1m3ZSQ",
                             #   l => 894,
                             #   ... }
                             # ]

# Retrieve the sequence length (l) for 2 GI identifiers and one invalid GI
gis = [300937843, 1346374, -2345324]
aseqs = sd.find(gis, {'type':'gi', 'fields':'l'})
if aseqs:
    # [ {"l":225,"id":"yg8A8H8N-4x1Ezf8WW-YbA"},
    #   {"l":894,"id":"naytI0dLM_rK2kaC1m3ZSQ"},
    #   None ]
else:
    print(sd.lastError())

findOne

Description:
Retrieves a single record from SeqDepot. Unless otherwise specified (see parameters), all fields are returned by default.
Parameters:
  1. ids {string | number} a sequence identifiers
  2. params {hash} (optional) see find parameters for details.
Returns:
{undef | hash}
Example:
use SeqDepot;

# Retrieve all data for a single sequence by its aseq_id; note that
# unlike find, a hash is returned rather than an array.
my $sd = new SeqDepot();
my $aseq_id = "naytI0dLM_rK2kaC1m3ZSQ";
my $aseq = $sd->findOne($aseq_id);  # {_id => "naytI0dLM_rK2kaC1m3ZSQ",
                                    #   l => 894,
                                    #   ... }
import SeqDepot

# Retrieve all data for a single sequence by its aseq_id; note that
# unlike find, a dictionary is returned rather than an array.
sd = SeqDepot.new()
aseq_id = "naytI0dLM_rK2kaC1m3ZSQ"
aseq = sd.findOne(aseq_id)  # {'_id' : "naytI0dLM_rK2kaC1m3ZSQ",
                            #   'l' : 894,
                            #   ... }

isToolDone

Description:
Returns true if the requested tool has been marked as done from the status string. The status string corresponds to the aseqs._s field and contains information about which predictive tools have been executed and whether any results were found with the tool identified by toolId.
Parameters:
  1. toolId {string} tool identifier; list of valid tool ids
  2. status {string} status string
Returns:
{boolean}
Example:
use SeqDepot;

my $sd = new SeqDepot();
my $aseq_id = "naytI0dLM_rK2kaC1m3ZSQ";
my $aseq = $sd->findOne($aseq_id);  # {_id => "naytI0dLM_rK2kaC1m3ZSQ",
                                    #   l => 894,
                                    #   _s => "TTTdTdT-TdTTTdTTddT",
                                    #   ... }
$sd->isToolDone("pfam26", $aseq->{_s});    # 1 (true)
import SeqDepot

sd = SeqDepot.new()
aseq_id = "naytI0dLM_rK2kaC1m3ZSQ"
aseq = sd.findOne(aseq_id)  # {'_id' : "naytI0dLM_rK2kaC1m3ZSQ",
                            #  'l' : 894,
                            #  '_s' : "TTTdTdT-TdTTTdTTddT",
                            #   ... }
sd.isToolDone("pfam26", aseq['_s'])    # 1 (true)

isValidAseqId

Description:
Static method that returns true if id is a validly formatted Aseq ID; false otherwise.
Parameters:
id {string}
Returns:
{boolean}
Example:
use SeqDepot;

print SeqDepot::isValidAseqId('yg8A8H8N-4x1Ezf8WW-YbA');    # 1 (true)
print SeqDepot::isValidAseqId('yg8A8H8N-4x1Ezf8WW-Yb');     # 0 (false)
print SeqDepot::isValidAseqId('yg8A8H8N-4x1Ezf8WW-YbAA');   # 0 (false)
print SeqDepot::isValidAseqId(undef);                       # 0 (false)
import SeqDepot

print(SeqDepot.isValidAseqId('yg8A8H8N-4x1Ezf8WW-YbA'))    # 1 (true)
print(SeqDepot.isValidAseqId('yg8A8H8N-4x1Ezf8WW-Yb'))     # 0 (false)
print(SeqDepot.isValidAseqId('yg8A8H8N-4x1Ezf8WW-YbAA'))   # 0 (false)
print(SeqDepot.isValidAseqId(None))                        # 0 (false)

isValidFieldString

Description:
Static methods that returns true if fields is validly formatted; false otherwise
Parameters:
fields {string} comma- and pipe-separated field string; see find parameters for details
Returns:
{boolean}
Example:
use SeqDepot;

print SeqDepot::isValidFieldString('l,s,_s');    # 1 (true)
print SeqDepot::isValidFieldString('');          # 0 (false)
# The following returns 1 (true)
print SeqDepot::isValidFieldString('t(pfam26|das|hamap),x(uni)');
print SeqDepot::isValidFieldString('x(my_db)');
import SeqDepot

print(SeqDepot.isValidFieldString('l,s,_s'))    # 1 (true)
print(SeqDepot.isValidFieldString(''))          # 0 (false)
# The following returns 1 (true)
print(SeqDepot.isValidFieldString('t(pfam26|das|hamap),x(uni)'))
print(SeqDepot.isValidFieldString('x(my_db)'))

lastError

Description:
Returns any error that may have occurred or undef if there was no error for the last find operation.
Parameters:
None
Returns:
{string | undef}
Example:
use SeqDepot;

my $sd = new SeqDepot();
my $aseqs = $sd->find(...);
if (!$aseqs) {
    # Uh oh, an error occurred, inform user.
    print $sd->lastError();
}
import SeqDepot

sd = SeqDepot.new()
aseqs = sd.find(...)
if not aseqs:
    # Uh oh, an error occurred, inform user.
    print(sd.lastError())

MD5HexFromAseqId

Description:
Static method that returns the equivalent MD5 hexadecimal representation of aseqId.
Parameters:
aseqId {string}
Returns:
{string}
Example:
use SeqDepot;

# Prints "ca0f00f07f0dfb8c751337fc596f986c"
print SeqDepot::MD5HexFromAseqId('yg8A8H8N-4x1Ezf8WW-YbA');
import SeqDepot

# Prints "ca0f00f07f0dfb8c751337fc596f986c"
print(SeqDepot.MD5HexFromAseqId('yg8A8H8N-4x1Ezf8WW-YbA'))

MD5HexFromSequence

Description:
Static method for computing the hexadecimal MD5 digest from sequence. It is recommended to clean the sequence before calling this method.
Parameters:
sequence {string}
Returns:
{string}
Example:
use SeqDepot;

my $sequence = "MTNVLIVEDEQAIRRFLRTALEGDGMRVFEAETLQRGLLEAATRKPDLIILDLGLPDGDGIEFIRDLRQWSAVPVIVLSARSEESDKIAALDAGADDYLSKPFGIGELQARLRVALRRHSATTAPDPLVKFSDVTVDLAARVIHRGEEEVHLTPIEFRLLAVLLNNAGKVLTQRQLLNQVWGPNAVEHSHYLRIYMGHLRQKLEQDPARPRHFITETGIGYRFML";
# Prints "ca0f00f07f0dfb8c751337fc596f986c"
print SeqDepot::MD5HexFromSequence($sequence);
import SeqDepot

sequence = "MTNVLIVEDEQAIRRFLRTALEGDGMRVFEAETLQRGLLEAATRKPDLIILDLGLPDGDGIEFIRDLRQWSAVPVIVLSARSEESDKIAALDAGADDYLSKPFGIGELQARLRVALRRHSATTAPDPLVKFSDVTVDLAARVIHRGEEEVHLTPIEFRLLAVLLNNAGKVLTQRQLLNQVWGPNAVEHSHYLRIYMGHLRQKLEQDPARPRHFITETGIGYRFML"
# Prints "ca0f00f07f0dfb8c751337fc596f986c"
print(SeqDepot.MD5HexFromSequence(sequence))

primeFastaBuffer

Description:
Sets the internal FASTA parsing buffer to fastaBuffer. This is useful when an input stream has already been partially read but not processed as part of the FASTA parsing. For example, when reading a line from STDIN to determine if it is FASTA data.
Parameters:
fastaBuffer {string}
Returns:
None

readFastaSequence

Description:

Reads a FASTA-formatted sequence from an open file handle and returns an array containing the header and the cleaned sequence. The header will not contain the > symbol. Returns undef if there are no more sequences to be read from the file handle.

Whitespace is trimmed from both ends of the header line.

Parameters:
fileHandle {open file handle}
Returns:
undef if end-of-file has been reached; otherwise, a 2-element array containing the header and cleaned sequence
Example:
use SeqDepot;

my $sd = new SeqDepot();

my $file = shift or die qq(Please provide a FASTA file\n);
open (IN, "< $file") or die qq(Unable to open file, $file: $!\n);
while (my $seq = $sd->readFastaSequence(*IN)) {
    # $seq is:
    # ["Header", "MTNVLIVEDEQAIR..."]
    print "Read one sequence\n";
    print "Header: $seq->[0]\n";
    print "Clean sequence: $seq->[1]\n";

    # ...
}
close (IN);
import SeqDepot

sd = SeqDepot.new()

while True:
	file = input("Please provide a FAST file: ")
	try:
		IN = open(file,'r')
		break
	except:
		print("Unable to open file, " + file)
		pass
seq = sd.readFastaSequence(IN)
while seq:
    # $seq is:
    # ["Header", "MTNVLIVEDEQAIR..."]
    print("\nRead one sequence")
    print("Header: " + seq[0])
    print("Clean sequence: " + seq[1])

    # ...
    seq = sd.readFastaSequence(IN)

resetFastaBuffer

Description:

Clears the internal buffer used to read FASTA sequences. Call this method before readFastaSequence if all of the following are true:

  1. Changing filehandles,
  2. the filehandle has been partially read from, and
  3. the filehandle has not been completely read through to the end.
Parameters:
None
Returns:
None

saveImage

Description:
Saves an image of the Aseq record for id.
Parameters:
  1. id {string | number} Aseq ID | GI | UniProt ID | PDB ID | MD5 hex; if other than Aseq ID, must specify the type in the params argument.
  2. fileName {string} [optional] name of file to save image to; defaults to id with the appropriate file extension
  3. params {hash} (optional) qualifies the query with the following:
    • type {string} identifier type; see find parameters for details
    • fields {string} comma-separated field string ; see find parameters for details

    • format {string} type of image to save; only png and svg are supported; defaults to png

Returns:
None
Example:
use SeqDepot;

my $sd = new SeqDepot();

my $gi = 3355692;

# Save domain architecture image to the current directory
# as "3355692.png"
$sd->saveImage($gi, undef, {type => 'gi'});

# Save svg with custom filename: "cher.svg"
$sd->saveImage('CHER_ECOLI', 'cher.svg', {type => 'uni', format => 'svg'});
import SeqDepot

sd = SeqDepot.new()

gi = 3355692

# Save domain architecture image to the current directory
# as "3355692.png"
sd.saveImage(gi, None, {'type' : 'gi'})

# Save svg with custom filename: "cher.svg"
sd.saveImage('CHER_ECOLI', 'cher.svg', {'type' : 'uni', 'format' : 'svg'})

toolFields

Description:
Returns the field names associated with toolId or null if an error occurs
Parameters:
toolId {string} tool identifier; list of valid tool ids
Returns:
{array.<string> | null}
Example:
use SeqDepot;

my $sd = new SeqDepot();

my $fieldNames = $sd->toolFields('das');
if ($fieldNames) {
    foreach my $fieldName (@$fieldNames) {
        print "Field: $fieldName\n";
        # Prints
        # start
        # stop
        # peak
        # peak_score
        # evalue
    }
}
import SeqDepot

sd = SeqDepot.new()

fieldNames = sd.toolFields('das')
if fieldNames:
    for fieldName in fieldNames:
        print("Field: " + fieldName)
        # Prints
        # start
        # stop
        # peak
        # peak_score
        # evalue

toolNames

Description:
Returns an ordered array of all valid tool identifiers on success; or null if an error occurred.
Parameters:
None
Returns:
{array.<string> | null}
Example:
use SeqDepot;

my $sd = new SeqDepot();

my $toolIds = $sd->toolNames();
if ($toolIds) {
    foreach my $toolId (@$toolIds) {
        print "Field: $toolId\n";
        # Prints
        # agfam1
        # coils
        # ...
    }
}
import SeqDepot

sd = SeqDepot.new()

toolIds = sd.toolNames()
if toolIds:
    for toolId in toolIds:
        print("Field: " + toolId)
        # Prints
        # agfam1
        # coils
        # ...

tools

Description:
Returns a hash of tools available in SeqDepot and their associated fields on success; or null otherwise. The hash key value is the tool identifier and the value are the various tool fields.
Parameters:
None
Returns:
{hash | null}
Example:
use SeqDepot;

my $sd = new SeqDepot();

my $tools = $sd->tools();
if ($tools) {
    foreach my $toolId (keys %$tools) {
        my @fields = @{$tools->{$toolId}});
        print "Id: $toolId -> ", join(', ', @fields) , "\n";
        # Prints
        # agfam1 -> name, start, stop, extent, hmm_start, hmm_stop, hmm_extent, score, evalue
        # coils -> start, stop
        # ...
    }
}
import SeqDepot

sd = SeqDepot.new()

tools = sd.tools()
if tools:
    for toolId in tools.keys():
        fields = tools[toolId]
        print("Id: "+ toolId +" -> " + ', '.join(fields))
        # Prints
        # agfam1 -> name, start, stop, extent, hmm_start, hmm_stop, hmm_extent, score, evalue
        # coils -> start, stop
        # ...
    }
}

Python

Davi Ortega has kindly ported the Perl module described here to the Python programming language. This module adheres to the Perl interface documented on this page (excepting syntax constructs and differences specific to each language).

You may retrieve the Python module from the download page.

Citation

The SeqDepot database has been published in the Bioinformatics journal. Please cite the following manuscript if you use SeqDepot in your research:

SeqDepot: streamlined database of biological sequences and precomputed features.
Ulrich, L.E. and Zhulin, I.B. Bioinformatics (2014).

Acknowledgments

We thank the following individuals / organizations: