Introduction
SeqDepot was borne out of a need to easily and rapidly access precomputed data for protein sequences. Moreover, it is computationally wasteful to repeatedly analyze identical sequences that yield the same results. Many useful resources with similar goals already exist, so why build another one?
- Inability to add new sequences: we need to incorporate novel sequences on demand to support downstream projects (processing new genomes into MiST)
- Limited precomputed data: again, we have various projects that require additional predictive data not supported by or available to third party resources
- Inadequate query API: poorly implemented, external dependencies, slow, unable to query large numbers of sequences, and importantly, not easily consumable (i.e. requires significant programming to utilize)
Standing on the shoulders of giants
Rather than starting from scratch, SeqDepot builds on the value already provided by great projects like SIMAP, which is the source for a large portion of the core SeqDepot database. A flexible computational pipeline in conjunction with access to a high-performance compute cluster enable us to seamlessly add datasets derived with custom tools. A well-documented and powerful RESTful API enables complete access to the entire scientific community and returns easily consumable JSON responses.
Infrastructure
At its core, the SeqDepot database is quite simple. A central MongoDB database instance stores protein sequences, a variety of precomputed feature data, and various cross-references to external databases. Sequence and annotations are merged from the SIMAP project, UniProt, the NCBI non-redundant database, and the PDB database. All sequences are uniquely identified by Aseq IDs. A few more features are then predicted on a local compute cluster. All information is exposed via a RESTful API.
Update process
- Download all features and sequences from the SIMAP FTP site
- Download the latest PDB database
- Download the latest NR database
- Extract UniProt identifiers from the simap sequences file
- All sequences, features, and cross-references are then merged into a single master JSON file
- The aggregate JSON data file is then merged into the existing SeqDepot database
- All novel sequences are then further analyzed on the Newton high-performance compute cluster:
- AGfam domains
- Pfam domains
- Extra-Cytoplasmic Function domains
- Transmembrane regions (DAS TM-Filter)
- Coiled-coils (coils)
- Low-complexity regions (seg)
System details
Hardware
- (4) Intel 64-bit X5647 quad core processors
- 32 GB RAM
- 1 TB SSD hard drive space
Software
Intrinsic identifiers and the Aseq ID
Sequence identification is vital to any project involving DNA or protein sequences. The large number of available sequence databases inevitably results in a huge number of proprietary identifiers. Mapping sequences from one database to another involves extensive cross-referencing. The problem is so wide-spread that various services such as the Protein Identifier Cross-Reference (PICR) have been developed to cross-reference more than 100 distinct databases.
Ideally, each sequence could be identified using only its sequence characters - in other words, an intrinsic identifier. Several algorithms from the cryptographic community do just this with varying degrees of success. For example, digesting the sequence with the MD5 hashing algorithm results in a small character string that uniquely identifiers the sequence - any change, even to a single character, will produce a different digest. The only weakness of such an approach are "collisions" - two distinct sequences that produce the same digest; however, this is negligible because a collision is very unlikely to occur and no collisions have yet been found in all currently known sequence databases. Stronger algorithms such as SHA-1 / SHA-2 / etc are less likely to collide; however, they also produce longer character strings, which demand more memory when populating a database engine.
Aseq ID
Each sequence in the SeqDepot database is uniquely identified by an Aseq ID, which is simply derived as follows (see table below for examples):
- Remove all non-sequence characters from the sequence
- Upper-case the sequence
- Generate a MD5 digest
- Encode the result in Base64 (smaller than its hexadecimal equivalent and human readable)
- Remove any padding characters (usually equal signs)
- Convert all forward slahes and plus signs to URL "friendly" underscores and dashes, respectively
Sequence | Base64 encoded MD5 | Aseq ID | ||
---|---|---|---|---|
>Tar MINRIRVVTLLVMVLGVFALLQLISGSLF FSSLHHSQKSFVVSNQLREQQGELTSTWD LMLQASTALNKAGTLTALSYPADDIKTLM ... |
→ | /wXylOa/eoFtpjBR9hTF2A== | → | _wXylOa_eoFtpjBR9hTF2A |
>CheW MTGMTNVTKLASEPSGQEFLVFTLGDEEY GIDILKVQEIRGYDQVTRIANTPAFIKGV TNLRGVIVPIVDLRIKFSQVDVDYNDNTV VIVLNLGQRVVGIVVDGVSDVLSLTAEQI RPAPEFAVTLSTEYLTGLGALGDRMLILV NIEKLLNSEEMALLDSAASEVA |
→ | /VZVW/e+iEqzExAuqrwEhQ== | → | _VZVW_e-iEqzExAuqrwEhQ |
>CheA MSMDISDFYQTFFDEADELLADMEQHLLV LQPEAPDAEQLNAIFRAAHSIKGGAGTFG FSVLQETTHLMENLLDEARRGEMQLNTDI INLFLETKDIMQEQLDAYKQSQEPDAASF DYIC... |
→ | fiUs+3vh34LxGVAdbheipg== | → | fiUs-3vh34LxGVAdbheipg |
The result is an Aseq ID: a source-agnostic, 22-character string derived solely from the sequence characters that uniquely identifies a specific amino acid sequence (note: this approach may easily be applied to DNA / RNA sequences as well). Aseq IDs circumvent the need to maintain various cross-references to external databases and simplify sequence identification. Moreover, it is small enough to be readily indexed by database systems without overly consuming large amounts of RAM.
Local installation
The complete MongoDB data files comprising SeqDepot may be downloaded and used with a local MongoDB instance. Simply follow the steps below:
- Requirements:
- At least 100GB free hard drive space (preferably 150GB or more)
- As much RAM as possible
- Follow these instructions to install MongoDB
- Download the SeqDepot Mongo Database
- Decompress the seqdepot tarball:
$ tar xvf seqdepot.latest.tar.gz
- Start the mongodb server:
$ cd seqdepot $ mongod --dbpath .
- Begin interacting with the database. For example, open a terminal, and try connecting with the mongo client:
$ mongo seqdepot
sdQuery.pl
sdQuery.pl is a command line program written in Perl for easily retrieving data from the SeqDepot database. sdQuery.pl reads an input file(s) of identifiers or FASTA sequences, queries SeqDepot for matches, and outputs the results. It may be easily used for one-off tasks or integrated into a computational pipeline.
Suppored input data
- FASTA
- Aseq ID
- GI
- PDB ID
- UniProt ID
- MD5 hexadecimal digests
Output formats
- JSON
- PNG or SVG images (saved to separate files)
- Tab-delimited (only applies to cross-references)
Installation
- Download sdQuery.pl and SeqDepot.pm to the same directory
- Run sdQuery.pl:
$ ./sdQuery.pl
Usage
Simply run sdQuery.pl without any arguments or the -h flag to view the usage:
$ ./sdQuery.pl Usage: sdQuery.pl [options] <input file>[ <input file> ...] To process via STDIN, provide - in place of all <input file>. Available options: ------------------ -h, --help : This help page. -t, --type = <string> : Type of input data. Acceptable values are fasta, aseq_id, gi, uni (UniProt ID), pdb, or md5_hex. Will guess if not explicitly specified. If not fasta, all ids must exist on separate lines. -f, --fields = <string> : List of fields to pull down from SeqDepot. By default, all available fields are requested. -a, --array-to-hashes : Convert all pre-computed array data to an array of hashes. -o, --out = <file> : Redirect all output to <file> instead of the console. -u, --outtype = <string> : Type of data to output. Acceptable values are json, json_per_line, fasta, png, svg, or xrefs (TSV). Defaults to json. If xrefs is used, then the xrefs option (-x, --xrefs) must be specified. -d, --image-dir = <path> : Directory to save images to. Applies only if the outType is png or svg. Defaults to the current directory. -n, --image-file-pattern = <string> : Filename pattern to use when saving images to disk. The special variable ${ID} will be replaced with the query identifier. In the case of FASTA input, this is its Aseq ID. Alternatively, ${FASTA_HEADER} may be used with fasta input to use the fasta header as the base file name. Defaults to ${ID}. -x, --xrefs = <string> : One or more comma-separated database names to cross-reference. Acceptable values are gi, uni, and pdb. Only relevant if outtype is set to xrefs. -p, --pretty-json : If outtype is json, then pretty print the results.
Additional option details
- -t, --type = <string>
If set to fasta, sdQuery generates the Aseq ID corresponding to each sequence and uses this to find matching records. No header data is used for identification purposes.
Otheriwse, if input consists of various identifers, they must be placed on separate lines and at the beginning of each line.
- -f, --fields = <string>
Comma-separated field string (see schema); secondary fields must be listed in parentheses, separated with pipe (|) symbols, and immediately suffix the respective primary field name. For example,
s,l Returns the sequence and length primary fields l,t,x Returns the length, all tool data, and all cross-references l,t(pfam26|smart) Returns the length primary field, and pfam and smart secondary fields of t l,t(pfam26|smart),x(gi) Same as the above but include any GI identifiers - -a, --array-to-hashes
This option transforms any requested precomputed data from an array of arrays into an array of hashes when the output type is either json or json_per_line.
To reduce the required storage space, all precomputed data is stored as an array of arrays (typically, an analytical tool may produce multiple results per sequence, and each result may have multiple attributes). For example, the DAS transmembrane prediction tool may predict several transmembrane regions with each described by five fields (start, stop, peak, peak_score, and evalue). Here is a partial record with DAS results without using the -a, --array-to-hashes option:
{ "l": 894, "id": "naytI0dLM_rK2kaC1m3ZSQ", "t": { "das": [ [ 403, 423, 411, 4.116, 0.0006308 ], [ 425, 445, 434, 5.243, 1.185e-5 ], ...
With the -a, --array-to-hashes option, the above would look like:
{ "l": 894, "id": "naytI0dLM_rK2kaC1m3ZSQ", "t": { "das": [ {"start": 403, "stop": 423, "peak": 411, "peak_score": 4.116, "evalue": 0.0006308 }, {"start": 425, "stop": 445, "peak": 434, "peak_score": 5.243, "evalue": 1.185e-5 }, ...
- -u, --outtype = <string>
If fasta, then each header line will be terminated with a numeric HTTP status code (200 if sequence exists in SeqDepot, 404 if not found, or 500 if a server error occurred). If the status is 200, a JSON string containing the requested data will also be appended.
If json_per_line, then each result (for each query) will be returned as an independent JSON encoded object on its own line. This is useful when dealing with many sequences because parsers can begin processing after each line has been read. In contrast, using json for this option, returns a single JSON encoded object, and it is necessary to first receive the entire response before any processing may begin.
If png or svg, then an image file will be created in the current directory (or value specified by the -d, --image-dir) for each matching sequence in SeqDepot.
sdQuery Examples
The example commands reference the following input files:
- gis.txt
$ cat gis.txt 300937843 1346374
- uni.txt
$ cat uni.txt F5CGH2_9HIV1
- seqs.faa
$ cat seqs.faa >accession:NP_415222.1|locus:b0694|genome:Escherichia coli str. K-12 substr. MG1655 MTNVLIVEDEQAIRRFLRTALEGDGMRVFEAETLQRGLLEAATRKPDLIILDLGLPDGDG IEFIRDLRQWSAVPVIVLSARSEESDKIAALDAGADDYLSKPFGIGELQARLRVALRRHS ATTAPDPLVKFSDVTVDLAARVIHRGEEEVHLTPIEFRLLAVLLNNAGKVLTQRQLLNQV WGPNAVEHSHYLRIYMGHLRQKLEQDPARPRHFITETGIGYRFML >b0695 [Escherichia coli str. K-12 substr. MG1655] MNNEPLRPDPDRLLEQTAAPHRGKLKVFFGACAGVGKTWAMLAEAQRLRAQGLDIVVGVV ETHGRKDTAAMLEGLAVLPLKRQAYRGRHISEFDLDAALARRPALILMDELAHSNAPGSR HPKRWQDIEELLEAGIDVFTTVNVQHLESLNDVVSGVTGIQVRETVPDPFFDAADDVVLV DLPPDDLRQRLKEGKVYIAGQAERAIEHFFRKGNLIALRELALRRTADRVDEQMRAWRGH PGEEKVWHTRDAILLCIGHNTGSEKLVRAAARLASRLGSVWHAVYVETPALHRLPEKKRR AILSALRLAQELGAETATLSDPAEEKAVVRYAREHNLGKIILGRPASRRWWRRETFADRL ARIAPDLDQVLVALDEPPARTINNAPDNRSFKDKWRVQIQGCVVAAALCAVITLIAMQWL MAFDAANLVMLYLLGVVVVALFYGRWPSVVATVINVVSFDLFFIAPRGTLAVSDVQYLLT FAVMLTVGLVIGNLTAGVRYQARVARYREQRTRHLYEMSKALAVGRSPQDIAATSEQFIA STFHARSQVLLPDDNGKLQPLTHPQGMTPWDDAIAQWSFDKGLPAGAGTDTLPGVPYQIL PLKSGEKTYGLVVVEPGNLRQLMIPEQQRLLETFTLLVANALERLTLTASEEQARMASER EQIRNALLAALSHDLRTPLTVLFGQAEILTLDLASEGSPHARQASEIRQHVLNTTRLVNN LLDMARIQSGGFNLKKEWLTLEEVVGSALQMLEPGLSSPINLSLPEPLTLIHVDGPLFER VLINLLENAVKYAGAQAEIGIDAHVEGENLQLDVWDNGPGLPPGQEQTIFDKFARGNKES AVPGVGLGLAICRAIVDVHGGTITAFNRPEGGACFRVTLPQQTAPELEEFHEDM >KdpC MSGLRPALSTFIFLLLITGGVYPLLTTVLGQWWFPWQANGSLIREGDTVRGSALIGQNFT GNGYFHGRPSATAEMPYNPQASGGSNLAVSNPELDKLIAARVAALRAANPDASASVPVEL VTASASGLDNNITPQAAAWQIPRVAKARNLSVEQLTQLIAKYSQQPLVKYIGQPVVNIVE LNLALDKLDE
- seqs-named.faa
$ cat seqs-named.faa >b0695 MNNEPLRPDPDRLLEQTAAPHRGKLKVFFGACAGVGKTWAMLAEAQRLRAQGLDIVVGVV ETHGRKDTAAMLEGLAVLPLKRQAYRGRHISEFDLDAALARRPALILMDELAHSNAPGSR HPKRWQDIEELLEAGIDVFTTVNVQHLESLNDVVSGVTGIQVRETVPDPFFDAADDVVLV DLPPDDLRQRLKEGKVYIAGQAERAIEHFFRKGNLIALRELALRRTADRVDEQMRAWRGH PGEEKVWHTRDAILLCIGHNTGSEKLVRAAARLASRLGSVWHAVYVETPALHRLPEKKRR AILSALRLAQELGAETATLSDPAEEKAVVRYAREHNLGKIILGRPASRRWWRRETFADRL ARIAPDLDQVLVALDEPPARTINNAPDNRSFKDKWRVQIQGCVVAAALCAVITLIAMQWL MAFDAANLVMLYLLGVVVVALFYGRWPSVVATVINVVSFDLFFIAPRGTLAVSDVQYLLT FAVMLTVGLVIGNLTAGVRYQARVARYREQRTRHLYEMSKALAVGRSPQDIAATSEQFIA STFHARSQVLLPDDNGKLQPLTHPQGMTPWDDAIAQWSFDKGLPAGAGTDTLPGVPYQIL PLKSGEKTYGLVVVEPGNLRQLMIPEQQRLLETFTLLVANALERLTLTASEEQARMASER EQIRNALLAALSHDLRTPLTVLFGQAEILTLDLASEGSPHARQASEIRQHVLNTTRLVNN LLDMARIQSGGFNLKKEWLTLEEVVGSALQMLEPGLSSPINLSLPEPLTLIHVDGPLFER VLINLLENAVKYAGAQAEIGIDAHVEGENLQLDVWDNGPGLPPGQEQTIFDKFARGNKES AVPGVGLGLAICRAIVDVHGGTITAFNRPEGGACFRVTLPQQTAPELEEFHEDM >KdpC MSGLRPALSTFIFLLLITGGVYPLLTTVLGQWWFPWQANGSLIREGDTVRGSALIGQNFT GNGYFHGRPSATAEMPYNPQASGGSNLAVSNPELDKLIAARVAALRAANPDASASVPVEL VTASASGLDNNITPQAAAWQIPRVAKARNLSVEQLTQLIAKYSQQPLVKYIGQPVVNIVE LNLALDKLDE
Fetch complete JSON record for GI identifiers
$ ./sdQuery.pl gis.txt
[{"query":"300937843","data":{"l":225,"_s":"TdddTdT-TddTTdTTddd","id":"yg8A8H8N-4x1Ezf8WW-YbA","x":{"gi":[16128670,170080361,170682921,188495689,218699050,238899960,300937843,300951198,300959271,301028821,301645940,312970765,331641191,386279706,386596462,387611184,387620427,388476786,404374022,415776911,417128818,417263781,417274153,417275078,417289965,417611706,417617084,417633157,417946768,417978416,418301546,418959019,419141211,419146606,419152564,419158008,419162933,419813292,422765222,422791470,422816668,422827886,423701438,425114035,425118795,425271370,425282045,432415614,432562567,432579346,432601223,432626240,432635967,432659921,432679116,432684496,432690585,432703233,432736199,432880151,432953818,433046822,442595712,2507374,1786911,85674742,169888196,170520639,188490888,218369036,238860865,260450151,299878183,300314143,300449540,300457127,301075799,309700920,310337414,315135350,315616391,323938337,323972049,331037989,339413644,342361480,344191917,345366191,345380958,345390827,359331399,371616312,377999426,378001534,378003302,378013164,378016324,384378190,385153832,385540141,385712792,386123258,386143774,386222669,386232581,386241731,386256003,404292509,408198433,408205813,408572529,408573073,430943990,431099800,431109048,431143435,431165036,431174249,431203284,431224514,431224622,431230497,431246723,431286103,431413775,431470314,431571450,441604263],"uni":["B1LLD8_ECOSM","B1X6M5_ECODH","B2N7S1_ECOLX","B7NMP7_ECO7I","C4ZWH0_ECOBW","C9R0S5_ECOD1","D8A890_ECOLX","D8AUM7_ECOLX","D8BAH1_ECOLX","D8C3V7_ECOLX","E1HN88_ECOLX","E2WSA7_ECOLX","E3PH87_ECOH1","E6B6M5_ECOLX","E9WC57_ECOLX","E9YE81_ECOLX","F4SKQ2_ECOLX","F9R4B0_ECOLX","G0FEX8_ECOLX","G2ARB8_ECOLX","G2B6K4_ECOLX","G2CGW5_ECOLX","G2F843_ECOLX","H0QDN9_ECOLI","H1DN11_ECOLX","H4UG14_ECOLX","H4UWV4_ECOLX","H4VCZ8_ECOLX","H4VTC8_ECOLX","H4W7A9_ECOLX","I0ZPA7_ECOLX","I2HYP8_ECOLX","I2PSR2_ECOLX","I2R5Q3_9ESCH","I2RQI4_ECOLX","I2X7W8_ECOLX","I2Y0L1_ECOLX","I2YRK8_ECOLX","I2ZXA6_ECOLX","I4JCK9_ECOLX","KDPE_ECOLI"]},"s":"MTNVLIVEDEQAIRRFLRTALEGDGMRVFEAETLQRGLLEAATRKPDLIILDLGLPDGDGIEFIRDLRQWSAVPVIVLSARSEESDKIAALDAGADDYLSKPFGIGELQARLRVALRRHSATTAPDPLVKFSDVTVDLAARVIHRGEEEVHLTPIEFRLLAVLLNNAGKVLTQRQLLNQVWGPNAVEHSHYLRIYMGHLRQKLEQDPARPRHFITETGIGYRFML","t":{"gene3d":[["3.40.50.2300","",1,122,2.2e-36],["1.10.10.10","winged helix repressor DNA binding domain",126,223,8.1e-28]],"superfam":[["SSF52172","CheY-like",1,189,7.8e-41]],"segs":[[46,61]],"pfam26":[["Response_reg",4,112,"..",0.008,1,111,"[.",4,113,"..",101.91,6.865e-32,1.248e-29,0.982],["Trans_reg_C",148,223,"..",0.026,2,77,".]",146,223,"..",77.617,6.415e-25,3.207e-22,0.973]],"agfam1":[["RR",3,118,"..",1,122,"[]",126.207,2.648e-37]],"smart":[["SM00448","REC",2,112,2.4e-40],["SM00862","Trans_reg_C",147,223,2.1e-22]],"proscan":[["PS50110","RESPONSE_REGULATORY",3,116,40.23]],"panther":[["PTHR26402",1,225,3.8e-86],["PTHR26402:SF259",1,225,3.8e-86]]}},"code":200},{"query":"1346374","data":{"l":894,"_s":"TTTdTdT-TdTTTdTTddT","id":"naytI0dLM_rK2kaC1m3ZSQ","x":{"gi":[16128671,170080362,238899961,300951199,300959272,301028820,301645939,331641192,386279707,386596461,386612864,386703866,387611185,387620428,388476787,417289371,417611707,417946767,417978415,418959018,419152565,419162934,419813291,422816669,423701439,425114036,425118796,432415615,432562568,432626241,432635968,432659922,432684497,432690586,432703234,432736200,432880152,432953819,442595713,1346374,146551,1651302,1786912,169888197,238861252,260450150,299878182,300314144,300449541,301075798,309700921,315135351,331037990,332342033,342361479,344191916,345366192,359331400,378003303,378016325,383102034,384378189,385153831,385540142,385712793,386123259,386255409,408572530,408573074,430943991,431099801,431165037,431174250,431203285,431224623,431230498,431246724,431286104,431413776,431470315,441604264],"uni":["B1X6M6_ECODH","C4ZWH1_ECOBW","C9R0S4_ECOD1","D8AUM8_ECOLX","D8BAH2_ECOLX","D8C3V6_ECOLX","E1HN87_ECOLX","E3PH88_ECOH1","F4M915_ECOLX","F4SKQ3_ECOLX","F9R4A9_ECOLX","G2ARB9_ECOLX","G2F842_ECOLX","H0QDP0_ECOLI","H4VCZ9_ECOLX","H4W7B0_ECOLX","H9UPV0_ECOLX","I0ZPA6_ECOLX","I2HYP7_ECOLX","I2PSR3_ECOLX","I2R5Q4_9ESCH","I2ZVL2_ECOLX","I4JCL0_ECOLX","KDPD_ECOLI"]},"s":"MNNEPLRPDPDRLLEQTAAPHRGKLKVFFGACAGVGKTWAMLAEAQRLRAQGLDIVVGVVETHGRKDTAAMLEGLAVLPLKRQAYRGRHISEFDLDAALARRPALILMDELAHSNAPGSRHPKRWQDIEELLEAGIDVFTTVNVQHLESLNDVVSGVTGIQVRETVPDPFFDAADDVVLVDLPPDDLRQRLKEGKVYIAGQAERAIEHFFRKGNLIALRELALRRTADRVDEQMRAWRGHPGEEKVWHTRDAILLCIGHNTGSEKLVRAAARLASRLGSVWHAVYVETPALHRLPEKKRRAILSALRLAQELGAETATLSDPAEEKAVVRYAREHNLGKIILGRPASRRWWRRETFADRLARIAPDLDQVLVALDEPPARTINNAPDNRSFKDKWRVQIQGCVVAAALCAVITLIAMQWLMAFDAANLVMLYLLGVVVVALFYGRWPSVVATVINVVSFDLFFIAPRGTLAVSDVQYLLTFAVMLTVGLVIGNLTAGVRYQARVARYREQRTRHLYEMSKALAVGRSPQDIAATSEQFIASTFHARSQVLLPDDNGKLQPLTHPQGMTPWDDAIAQWSFDKGLPAGAGTDTLPGVPYQILPLKSGEKTYGLVVVEPGNLRQLMIPEQQRLLETFTLLVANALERLTLTASEEQARMASEREQIRNALLAALSHDLRTPLTVLFGQAEILTLDLASEGSPHARQASEIRQHVLNTTRLVNNLLDMARIQSGGFNLKKEWLTLEEVVGSALQMLEPGLSSPINLSLPEPLTLIHVDGPLFERVLINLLENAVKYAGAQAEIGIDAHVEGENLQLDVWDNGPGLPPGQEQTIFDKFARGNKESAVPGVGLGLAICRAIVDVHGGTITAFNRPEGGACFRVTLPQQTAPELEEFHEDM","t":{"segs":[[94,107],[166,186],[266,277],[428,441],[711,722]],"pfam26":[["KdpD",21,230,"..",0.006,2,211,".]",20,230,"..",329.179,4.617e-103,4.617e-99,0.995],["HATPase_c",778,881,"..",0.003,6,110,"..",774,882,"..",84.573,1.307e-26,2.421e-24,0.965],["DUF4118",407,499,"..",9.917,5,103,"..",402,501,"..",54.331,1.598e-18,7.991e-15,0.836],["HisKA",664,730,"..",1.214,2,68,".]",663,730,"..",43.184,6.792e-14,1.887e-11,0.878],["GAF_3",528,644,"..",0.002,2,129,".]",527,644,"..",38.631,4.443e-13,6.347e-10,0.855],["Usp",251,365,"..",0.429,3,133,"..",249,373,"..",21.742,1.263e-07,0.0001149,0.847]],"agfam1":[["HK_CA:13",730,881,"..",1,158,"[]",175.069,5.176e-52],["HK_CA:2",730,881,"..",1,161,"[]",123.442,1.8e-36],["HK_CA:5",737,881,"..",1,144,"[]",108.901,4.29e-32]],"das":[[403,423,411,4.116,0.0006308],[425,445,434,5.243,1.185e-05],[448,464,456,3.252,0.01334],[476,493,485,4.305,0.0003238],[850,851,851,2.544,0.1621]],"panther":[["PTHR24423",22,894,1.3e-119],["PTHR24423:SF357",22,894,1.3e-119]],"gene3d":[["3.40.50.300","P-loop containing nucleotide triphosphate hydrolases",21,229,1.2e-100],["3.40.50.620","Tyrosyl-Transfer RNA Synthetase ; subunit E; domain 1",245,352,4.8e-06],["1.20.120.620","Backbone structure of the membrane domain of e. Coli histidine kinase receptor kdpd;",397,502,5.6e-39],["1.10.287.130","",657,726,6e-15],["3.30.565.10","",732,885,7.8e-38]],"superfam":[["SSF52402","Adenine nucleotide alpha hydrolases-like",248,378,5.2e-06],["SSF55781","GAF domain-like",508,659,2.8e-06],["SSF47384","Homodimeric domain of signal transducing histidine kinase",645,732,1.4e-15],["SSF55874","ATPase domain of HSP90 chaperone/DNA topoisomerase II/histidine kinase",719,893,1.5e-41]],"coils":[[642,662]],"tmhmm":[[399,421],[425,444],[449,471],[476,498]],"smart":[["SM00388","HisKA",663,730,1.4e-13],["SM00387","HATPase_c",773,883,4.9e-33]],"prints":[["PR00344","BCTRLSENSOR",810,824,1.6e-12],["PR00344","BCTRLSENSOR",828,838,1.6e-12],["PR00344","BCTRLSENSOR",843,861,1.6e-12],["PR00344","BCTRLSENSOR",867,880,1.6e-12]],"proscan":[["PS50109","HIS_KIN",670,883,45.15]]}},"code":200}]
Fetch length and Gene3D results for GI identifiers
$ ./sdQuery.pl -f "l,t(gene3d)" gis.txt
[{"query":"300937843","data":{"l":225,"id":"yg8A8H8N-4x1Ezf8WW-YbA","t":{"gene3d":[["3.40.50.2300","",1,122,2.2e-36],["1.10.10.10","winged helix repressor DNA binding domain",126,223,8.1e-28]]}},"code":200},{"query":"1346374","data":{"l":894,"id":"naytI0dLM_rK2kaC1m3ZSQ","t":{"gene3d":[["3.40.50.300","P-loop containing nucleotide triphosphate hydrolases",21,229,1.2e-100],["3.40.50.620","Tyrosyl-Transfer RNA Synthetase ; subunit E; domain 1",245,352,4.8e-06],["1.20.120.620","Backbone structure of the membrane domain of e. Coli histidine kinase receptor kdpd;",397,502,5.6e-39],["1.10.287.130","",657,726,6e-15],["3.30.565.10","",732,885,7.8e-38]]}},"code":200}]
Fetch length, smart, and panther results using FASTA
$ ./sdQuery.pl -f "l,t(smart|panther)" seqs.faa
[{"query":"yg8A8H8N-4x1Ezf8WW-YbA","data":{"l":225,"id":"yg8A8H8N-4x1Ezf8WW-YbA","t":{"smart":[["SM00448","REC",2,112,2.4e-40],["SM00862","Trans_reg_C",147,223,2.1e-22]],"panther":[["PTHR26402",1,225,3.8e-86],["PTHR26402:SF259",1,225,3.8e-86]]}},"header":"accession:NP_415222.1|locus:b0694|genome:Escherichia coli str. K-12 substr. MG1655","code":200},{"query":"naytI0dLM_rK2kaC1m3ZSQ","data":{"l":894,"id":"naytI0dLM_rK2kaC1m3ZSQ","t":{"smart":[["SM00388","HisKA",663,730,1.4e-13],["SM00387","HATPase_c",773,883,4.9e-33]],"panther":[["PTHR24423",22,894,1.3e-119],["PTHR24423:SF357",22,894,1.3e-119]]}},"header":"b0695 [Escherichia coli str. K-12 substr. MG1655]","code":200},{"query":"GS8z3QwN5MzpxU0aTuxuaA","data":{"l":190,"id":"GS8z3QwN5MzpxU0aTuxuaA","t":{"panther":[["PTHR30042",1,190,9.7e-95],["PTHR30042:SF0",1,190,9.7e-95]]}},"header":"KdpC","code":200}]
Fetch length, smart, and panther results using FASTA and rename arrays to hashes
$ ./sdQuery -f "l,t(smart|panther)" -p -a seqs.faa
Timeout error (e.g. the SeqDepot server is down)
$ ./sdQuery gis.txt
Requesting batch from SeqDepot...
Unable to connect to server; timeout or other internal error
Download PNG domain architecture visualizations for a list of GI numbers
$ ./sdQuery.pl -u png gis.txt // Creates PNG file: 300937843.png // Creates PNG file: 1346374.png
Download SVG domain architecture visualizations for FASTA sequences
Note: the file names consist of its Aseq ID because this is the default identifier used when querying with FASTA sequences. See the next example for an alternative naming scheme.
$ ./sdQuery.pl -u svg seqs.faa // Creates SVG file: yg8A8H8N-4x1Ezf8WW-YbA.svg // Creates SVG file: naytI0dLM_rK2kaC1m3ZSQ.svg // Creates SVG file: GS8z3QwN5MzpxU0aTuxuaA.svg
Download PNG domain architecture visualizations for FASTA sequences (using FASTA header as the file name)
$ ./sdQuery.pl -u png -n "${FASTA_HEADER}.png" seqs.faa // Creates PNG file: b0694.png // Creates PNG file: b0695.png // Creates PNG file: KdpC.png
Cross-reference a list of UniProt IDs to PDBs
$ ./sdQuery.pl -u xrefs -x pdb uni.txt
F5CGH2_9HIV1 1dp6,1dp8,1dp9,1drm,1lsv,1lsw,1lsx,1lt0
SeqDepot
We have developed both Perl and Python modules for facilitating many tasks related to interfacing with the SeqDepot server. Additionally, it includes a few subroutines for working with sequence data (e.g. parsing FASTA files).
Features
- Find sequences using Aseq, MD5 digests, GI, UniProt, or PDB identifiers
- Retrieve partial or entire records
- Interconvert Aseq IDs and hexadecimal MD5 digests
- Derive Aseq IDs and hexadecimal MD5 digests directly from sequences
- Clean sequences: remove whitespace and replace invalid characters
- Transform precomputed tool data (stored in arrays without field names) to array of hashes (with meaningful column names)
- Validation
- FASTA parser
- Save PNG or SVG visualizations
Download
Perl
- SeqDepot.pm
- Test_SeqDepot.pl (ensures that SeqDepot runs as expected and also contains several simple examples)
Python
- SeqDepot.py (Python 3.x)
- Test_SeqDepot.py (ensures that module runs as expected and also contains several simple examples)
These (and also the Python 2.x module) are also available on the download page.
Requirements
- Internet connection :)
- Perl v5.8 or higher
- The following Perl modules (most are common to a normal Perl installation)
- Carp
- Digest::MD5
- HTTP::Request::Common
- JSON
- LWP::UserAgent
- MIME::Base64
- The Python module has the following dependencies
- json
- hashlib (or md5 if using python 2.x)
- base64
- re
- binascii
- urllib (or urllib2 if using python 2.x)
Subroutines
aseqIdFromMD5Hex
- Description:
- Static method that converts an MD5 hexadecimal string into its Aseq ID equivalent.
- Parameters:
- MD5hex {string} hexademical MD5 digest
- Returns:
- Aseq ID {string}
- Example:
use SeqDepot; my $aseq_id = SeqDepot::aseqIdFromMD5Hex('ca0f00f07f0dfb8c751337fc596f986c'); print $aseq_id; # "yg8A8H8N-4x1Ezf8WW-YbA"
import SeqDepot aseq_id = SeqDepot.aseqIdFromMD5Hex('ca0f00f07f0dfb8c751337fc596f986c') print(aseq_id) # "yg8A8H8N-4x1Ezf8WW-YbA"
aseqIdFromSequence
- Description:
- Static method for computing the aseqId for a given sequence. It is recommended that all sequences are cleaned before calling this method.
- Parameters:
- sequence {string} ungapped, upper-case amino acid sequence
- Returns:
- Aseq ID {string}
- Example:
use SeqDepot; my $aseq_id = SeqDepot::aseqIdFromSequence('MTNVLIVEDEQAIRRFLRTALEGDGMRVFEAETLQRGLLEAATRKPDLIILDLGLPDGDGIEFIRDLRQWSAVPVIVLSARSEESDKIAALDAGADDYLSKPFGIGELQARLRVALRRHSATTAPDPLVKFSDVTVDLAARVIHRGEEEVHLTPIEFRLLAVLLNNAGKVLTQRQLLNQVWGPNAVEHSHYLRIYMGHLRQKLEQDPARPRHFITETGIGYRFML'); print $aseq_id; # "yg8A8H8N-4x1Ezf8WW-YbA"
import SeqDepot aseq_id = SeqDepot.aseqIdFromSequence('MTNVLIVEDEQAIRRFLRTALEGDGMRVFEAETLQRGLLEAATRKPDLIILDLGLPDGDGIEFIRDLRQWSAVPVIVLSARSEESDKIAALDAGADDYLSKPFGIGELQARLRVALRRHSATTAPDPLVKFSDVTVDLAARVIHRGEEEVHLTPIEFRLLAVLLNNAGKVLTQRQLLNQVWGPNAVEHSHYLRIYMGHLRQKLEQDPARPRHFITETGIGYRFML') print(aseq_id) # "yg8A8H8N-4x1Ezf8WW-YbA"
cleanSequence
- Description:
- Static method for removing all whitespace characters from sequence and replaces all digits or non-word characters with an ampersand character (for easy identification of invalid symbols).
- Parameters:
- sequence {string}
- Returns:
- {string}
- Example:
use SeqDepot; my $dirtySequence = "M tn\nVLI"; my $cleanSequence = SeqDepot::cleanSequence($dirtySequence); print $cleanSequence; # "MTNVLI" # Note: the 9 and - characters will be replaced with ampersands (@) my $sequenceWithInvalidChars = "MTNV 9 L - I"; $cleanSequence = SeqDepot::cleanSequence($sequenceWithInvalidChars); print $cleanSequence; # "MTNV@L@I"
import SeqDepot dirtySequence = "M tn\nVLI" cleanSequence = SeqDepot.cleanSequence(dirtySequence) print(cleanSequence) # "MTNVLI" # Note: the 9 and - characters will be replaced with ampersands (@) sequenceWithInvalidChars = "MTNV 9 L - I" cleanSequence = SeqDepot.cleanSequence(sequenceWithInvalidChars) print(cleanSequence) # "MTNV@L@I"
find
- Description:
-
Retrieves one or more records from SeqDepot. Unless otherwise specified (see parameters), all fields are returned by default.
Returns a mixed array of hashes or undefs, indicating whether the respective requested Aseq ID was found (undef meaning the requested Aseq ID was not found - not that some other error occurred).
- Parameters:
-
- ids {string | number | array.<string>} one or more sequence identifiers (if multiple, must all be of the same type)
- params {hash} (optional) qualifies the find with the following:
- type {string} identifier type; defaults to aseq_id, but use gi, uni, pdb, or md5_hex for GI, UniProt, PDB, or MD5 hexadecimal identifiers, respectively.
fields {string} comma-separated field string (see schema); secondary fields must be listed in parentheses, separated with pipe (|) symbols, and immediately suffix the respective primary field name. For example,
s,l Returns the sequence and length primary fields l,t,x Returns the length, all tool data, and all cross-references l,t(pfam26|smart) Returns the length primary field, and pfam and smart secondary fields of t l,t(pfam26|smart),x(gi) Same as the above but include any GI identifiers - labelToolData {boolean} defaults to false; if true converts any tool data (the t field) into an array of hashes with meaningful field names
- Returns:
{undef | array.<hash | undef>}
On success, returns a mixed array of hashes or undefs. A undef value for the nth element indicates that no Aseq record was found for the nth identifier.
Returns undef if a network error occurs. Call lastError to get the error message.
- Example:
use SeqDepot; # Retrieve all data for a single sequence by its aseq_id my $sd = new SeqDepot(); my $aseq_id = "naytI0dLM_rK2kaC1m3ZSQ"; my $aseqs = $sd->find($aseq_id); # [{_id => "naytI0dLM_rK2kaC1m3ZSQ", # l => 894, # ... } # ] # Retrieve the sequence length (l) for 2 GI identifiers and one invalid GI my $gis = [300937843, 1346374, -2345324]; my $aseqs = $sd->find($gis, {type => 'gi', fields => 'l'}); if ($aseqs) { # [ {"l":225,"id":"yg8A8H8N-4x1Ezf8WW-YbA"}, # {"l":894,"id":"naytI0dLM_rK2kaC1m3ZSQ"}, # undef ] } else { print $sd->lastError(); }
import SeqDepot # Retrieve all data for a single sequence by its aseq_id sd = SeqDepot.new() aseq_id = "naytI0dLM_rK2kaC1m3ZSQ" aseqs = sd.find(aseq_id) # [{_id => "naytI0dLM_rK2kaC1m3ZSQ", # l => 894, # ... } # ] # Retrieve the sequence length (l) for 2 GI identifiers and one invalid GI gis = [300937843, 1346374, -2345324] aseqs = sd.find(gis, {'type':'gi', 'fields':'l'}) if aseqs: # [ {"l":225,"id":"yg8A8H8N-4x1Ezf8WW-YbA"}, # {"l":894,"id":"naytI0dLM_rK2kaC1m3ZSQ"}, # None ] else: print(sd.lastError())
findOne
- Description:
- Retrieves a single record from SeqDepot. Unless otherwise specified (see parameters), all fields are returned by default.
- Parameters:
-
- ids {string | number} a sequence identifiers
- params {hash} (optional) see find parameters for details.
- Returns:
- {undef | hash}
- Example:
use SeqDepot; # Retrieve all data for a single sequence by its aseq_id; note that # unlike find, a hash is returned rather than an array. my $sd = new SeqDepot(); my $aseq_id = "naytI0dLM_rK2kaC1m3ZSQ"; my $aseq = $sd->findOne($aseq_id); # {_id => "naytI0dLM_rK2kaC1m3ZSQ", # l => 894, # ... }
import SeqDepot # Retrieve all data for a single sequence by its aseq_id; note that # unlike find, a dictionary is returned rather than an array. sd = SeqDepot.new() aseq_id = "naytI0dLM_rK2kaC1m3ZSQ" aseq = sd.findOne(aseq_id) # {'_id' : "naytI0dLM_rK2kaC1m3ZSQ", # 'l' : 894, # ... }
isToolDone
- Description:
- Returns true if the requested tool has been marked as done from the status string. The status string corresponds to the aseqs._s field and contains information about which predictive tools have been executed and whether any results were found with the tool identified by toolId.
- Parameters:
-
- toolId {string} tool identifier; list of valid tool ids
- status {string} status string
- Returns:
- {boolean}
- Example:
use SeqDepot; my $sd = new SeqDepot(); my $aseq_id = "naytI0dLM_rK2kaC1m3ZSQ"; my $aseq = $sd->findOne($aseq_id); # {_id => "naytI0dLM_rK2kaC1m3ZSQ", # l => 894, # _s => "TTTdTdT-TdTTTdTTddT", # ... } $sd->isToolDone("pfam26", $aseq->{_s}); # 1 (true)
import SeqDepot sd = SeqDepot.new() aseq_id = "naytI0dLM_rK2kaC1m3ZSQ" aseq = sd.findOne(aseq_id) # {'_id' : "naytI0dLM_rK2kaC1m3ZSQ", # 'l' : 894, # '_s' : "TTTdTdT-TdTTTdTTddT", # ... } sd.isToolDone("pfam26", aseq['_s']) # 1 (true)
isValidAseqId
- Description:
- Static method that returns true if id is a validly formatted Aseq ID; false otherwise.
- Parameters:
- id {string}
- Returns:
- {boolean}
- Example:
use SeqDepot; print SeqDepot::isValidAseqId('yg8A8H8N-4x1Ezf8WW-YbA'); # 1 (true) print SeqDepot::isValidAseqId('yg8A8H8N-4x1Ezf8WW-Yb'); # 0 (false) print SeqDepot::isValidAseqId('yg8A8H8N-4x1Ezf8WW-YbAA'); # 0 (false) print SeqDepot::isValidAseqId(undef); # 0 (false)
import SeqDepot print(SeqDepot.isValidAseqId('yg8A8H8N-4x1Ezf8WW-YbA')) # 1 (true) print(SeqDepot.isValidAseqId('yg8A8H8N-4x1Ezf8WW-Yb')) # 0 (false) print(SeqDepot.isValidAseqId('yg8A8H8N-4x1Ezf8WW-YbAA')) # 0 (false) print(SeqDepot.isValidAseqId(None)) # 0 (false)
isValidFieldString
- Description:
- Static methods that returns true if fields is validly formatted; false otherwise
- Parameters:
- fields {string} comma- and pipe-separated field string; see find parameters for details
- Returns:
- {boolean}
- Example:
use SeqDepot; print SeqDepot::isValidFieldString('l,s,_s'); # 1 (true) print SeqDepot::isValidFieldString(''); # 0 (false) # The following returns 1 (true) print SeqDepot::isValidFieldString('t(pfam26|das|hamap),x(uni)'); print SeqDepot::isValidFieldString('x(my_db)');
import SeqDepot print(SeqDepot.isValidFieldString('l,s,_s')) # 1 (true) print(SeqDepot.isValidFieldString('')) # 0 (false) # The following returns 1 (true) print(SeqDepot.isValidFieldString('t(pfam26|das|hamap),x(uni)')) print(SeqDepot.isValidFieldString('x(my_db)'))
lastError
- Description:
- Returns any error that may have occurred or undef if there was no error for the last find operation.
- Parameters:
- None
- Returns:
- {string | undef}
- Example:
use SeqDepot; my $sd = new SeqDepot(); my $aseqs = $sd->find(...); if (!$aseqs) { # Uh oh, an error occurred, inform user. print $sd->lastError(); }
import SeqDepot sd = SeqDepot.new() aseqs = sd.find(...) if not aseqs: # Uh oh, an error occurred, inform user. print(sd.lastError())
MD5HexFromAseqId
- Description:
- Static method that returns the equivalent MD5 hexadecimal representation of aseqId.
- Parameters:
- aseqId {string}
- Returns:
- {string}
- Example:
use SeqDepot; # Prints "ca0f00f07f0dfb8c751337fc596f986c" print SeqDepot::MD5HexFromAseqId('yg8A8H8N-4x1Ezf8WW-YbA');
import SeqDepot # Prints "ca0f00f07f0dfb8c751337fc596f986c" print(SeqDepot.MD5HexFromAseqId('yg8A8H8N-4x1Ezf8WW-YbA'))
MD5HexFromSequence
- Description:
- Static method for computing the hexadecimal MD5 digest from sequence. It is recommended to clean the sequence before calling this method.
- Parameters:
- sequence {string}
- Returns:
- {string}
- Example:
use SeqDepot; my $sequence = "MTNVLIVEDEQAIRRFLRTALEGDGMRVFEAETLQRGLLEAATRKPDLIILDLGLPDGDGIEFIRDLRQWSAVPVIVLSARSEESDKIAALDAGADDYLSKPFGIGELQARLRVALRRHSATTAPDPLVKFSDVTVDLAARVIHRGEEEVHLTPIEFRLLAVLLNNAGKVLTQRQLLNQVWGPNAVEHSHYLRIYMGHLRQKLEQDPARPRHFITETGIGYRFML"; # Prints "ca0f00f07f0dfb8c751337fc596f986c" print SeqDepot::MD5HexFromSequence($sequence);
import SeqDepot sequence = "MTNVLIVEDEQAIRRFLRTALEGDGMRVFEAETLQRGLLEAATRKPDLIILDLGLPDGDGIEFIRDLRQWSAVPVIVLSARSEESDKIAALDAGADDYLSKPFGIGELQARLRVALRRHSATTAPDPLVKFSDVTVDLAARVIHRGEEEVHLTPIEFRLLAVLLNNAGKVLTQRQLLNQVWGPNAVEHSHYLRIYMGHLRQKLEQDPARPRHFITETGIGYRFML" # Prints "ca0f00f07f0dfb8c751337fc596f986c" print(SeqDepot.MD5HexFromSequence(sequence))
primeFastaBuffer
- Description:
- Sets the internal FASTA parsing buffer to fastaBuffer. This is useful when an input stream has already been partially read but not processed as part of the FASTA parsing. For example, when reading a line from STDIN to determine if it is FASTA data.
- Parameters:
- fastaBuffer {string}
- Returns:
- None
readFastaSequence
- Description:
Reads a FASTA-formatted sequence from an open file handle and returns an array containing the header and the cleaned sequence. The header will not contain the > symbol. Returns undef if there are no more sequences to be read from the file handle.
Whitespace is trimmed from both ends of the header line.
- Parameters:
- fileHandle {open file handle}
- Returns:
- undef if end-of-file has been reached; otherwise, a 2-element array containing the header and cleaned sequence
- Example:
use SeqDepot; my $sd = new SeqDepot(); my $file = shift or die qq(Please provide a FASTA file\n); open (IN, "< $file") or die qq(Unable to open file, $file: $!\n); while (my $seq = $sd->readFastaSequence(*IN)) { # $seq is: # ["Header", "MTNVLIVEDEQAIR..."] print "Read one sequence\n"; print "Header: $seq->[0]\n"; print "Clean sequence: $seq->[1]\n"; # ... } close (IN);
import SeqDepot sd = SeqDepot.new() while True: file = input("Please provide a FAST file: ") try: IN = open(file,'r') break except: print("Unable to open file, " + file) pass seq = sd.readFastaSequence(IN) while seq: # $seq is: # ["Header", "MTNVLIVEDEQAIR..."] print("\nRead one sequence") print("Header: " + seq[0]) print("Clean sequence: " + seq[1]) # ... seq = sd.readFastaSequence(IN)
resetFastaBuffer
- Description:
Clears the internal buffer used to read FASTA sequences. Call this method before readFastaSequence if all of the following are true:
- Changing filehandles,
- the filehandle has been partially read from, and
- the filehandle has not been completely read through to the end.
- Parameters:
- None
- Returns:
- None
saveImage
- Description:
- Saves an image of the Aseq record for id.
- Parameters:
-
- id {string | number} Aseq ID | GI | UniProt ID | PDB ID | MD5 hex; if other than Aseq ID, must specify the type in the params argument.
- fileName {string} [optional] name of file to save image to; defaults to id with the appropriate file extension
- params {hash} (optional) qualifies the query with the following:
- type {string} identifier type; see find parameters for details
fields {string} comma-separated field string ; see find parameters for details
format {string} type of image to save; only png and svg are supported; defaults to png
- Returns:
- None
- Example:
use SeqDepot; my $sd = new SeqDepot(); my $gi = 3355692; # Save domain architecture image to the current directory # as "3355692.png" $sd->saveImage($gi, undef, {type => 'gi'}); # Save svg with custom filename: "cher.svg" $sd->saveImage('CHER_ECOLI', 'cher.svg', {type => 'uni', format => 'svg'});
import SeqDepot sd = SeqDepot.new() gi = 3355692 # Save domain architecture image to the current directory # as "3355692.png" sd.saveImage(gi, None, {'type' : 'gi'}) # Save svg with custom filename: "cher.svg" sd.saveImage('CHER_ECOLI', 'cher.svg', {'type' : 'uni', 'format' : 'svg'})
toolFields
- Description:
- Returns the field names associated with toolId or null if an error occurs
- Parameters:
- toolId {string} tool identifier; list of valid tool ids
- Returns:
- {array.<string> | null}
- Example:
use SeqDepot; my $sd = new SeqDepot(); my $fieldNames = $sd->toolFields('das'); if ($fieldNames) { foreach my $fieldName (@$fieldNames) { print "Field: $fieldName\n"; # Prints # start # stop # peak # peak_score # evalue } }
import SeqDepot sd = SeqDepot.new() fieldNames = sd.toolFields('das') if fieldNames: for fieldName in fieldNames: print("Field: " + fieldName) # Prints # start # stop # peak # peak_score # evalue
toolNames
- Description:
- Returns an ordered array of all valid tool identifiers on success; or null if an error occurred.
- Parameters:
- None
- Returns:
- {array.<string> | null}
- Example:
use SeqDepot; my $sd = new SeqDepot(); my $toolIds = $sd->toolNames(); if ($toolIds) { foreach my $toolId (@$toolIds) { print "Field: $toolId\n"; # Prints # agfam1 # coils # ... } }
import SeqDepot sd = SeqDepot.new() toolIds = sd.toolNames() if toolIds: for toolId in toolIds: print("Field: " + toolId) # Prints # agfam1 # coils # ...
tools
- Description:
- Returns a hash of tools available in SeqDepot and their associated fields on success; or null otherwise. The hash key value is the tool identifier and the value are the various tool fields.
- Parameters:
- None
- Returns:
- {hash | null}
- Example:
use SeqDepot; my $sd = new SeqDepot(); my $tools = $sd->tools(); if ($tools) { foreach my $toolId (keys %$tools) { my @fields = @{$tools->{$toolId}}); print "Id: $toolId -> ", join(', ', @fields) , "\n"; # Prints # agfam1 -> name, start, stop, extent, hmm_start, hmm_stop, hmm_extent, score, evalue # coils -> start, stop # ... } }
import SeqDepot sd = SeqDepot.new() tools = sd.tools() if tools: for toolId in tools.keys(): fields = tools[toolId] print("Id: "+ toolId +" -> " + ', '.join(fields)) # Prints # agfam1 -> name, start, stop, extent, hmm_start, hmm_stop, hmm_extent, score, evalue # coils -> start, stop # ... } }
Python
Davi Ortega has kindly ported the Perl module described here to the Python programming language. This module adheres to the Perl interface documented on this page (excepting syntax constructs and differences specific to each language).
You may retrieve the Python module from the download page.
Citation
The SeqDepot database has been published in the Bioinformatics journal. Please cite the following manuscript if you use SeqDepot in your research:
SeqDepot: streamlined database of biological sequences and precomputed features.
Ulrich, L.E. and Zhulin, I.B. Bioinformatics (2014).
Acknowledgments
We thank the following individuals / organizations:
- Davi Ortega for assisting with server configuration and administration
- University of Tennessee for hosting the SeqDepot database and website and providing access time to the Newton compute cluster
- SIMAP and InterPro for providing much of the precomputed data inside SeqDepot