Another frequent situation arises where we are BLASTing with protein families that we are perhaps less familiar with but we are after a limited set of sequences that match some general goals we have in mind: what we find in the output will affect what we will decide to retrieve. In such cases, information carried through into the output is essential to rapidly assess what the output set consists of. Unfortunately the uninformative BLAST outputs for databases like SPTREMBL are a scourge of users.
So, to get to the point, the key features of BLAST2SRS are
This list illustrates some typical selection criteria where BLAST2SRS is an appropriate tool:
To learn more about BLAST and SRS, follow the links to read the documentation provided by the developers of these tools.
Next you must review the search parameterisation. The most important of these is to select an appropriate database of sequences to search against. All other parameters can be left on default although they can strongly affect the results under particular circumstances, especially the reduced sequence complexity filter SEG. This applies to all BLAST servers and if you are not familiar with these parameters, you should bone up on the BLAST home page at the NCBI. Select SWISSALL for the TFIIB query.
UniProt-TrEMBL Release 33.0 May-2006 contained 2,948,323 entries.
Automatically annotated entries carrying through translational annotation from the EMBL and other DNA sequence databases. Also automated annotation and classification. Much larger dataset than SWISS-PROT but very limited and cryptic annotation. Entries are gradually migrated by hand into SWISS-PROT. Full updates are once or twice per year but regular small updates are released too.
UniProt
The combined database including both SWISS-PROT and TrEMBL. A comprehensive collection of protein sequences in the general public databases. Select this database to search all sequences that have so far been incorporated.
(There may be specialised databases on-line with useful sequences that are not yet incorporated, e.g. for organism-specific genome projects.)
Now click the [Run Blast] box to launch the job. The time taken by the server to return output is influenced by the length of sequence and number of hits. Usually it takes a minute, but do not panic if it takes considerably longer. You are able to BookMark the output page: recommended for serious work.
The TF2B_example finds about 70 sequences with good E-value in SWISSALL as of (2.03).
In the header the following options are available.
Species list
The list of species found in the hits helps you check if your favourites are in there. Tick boxes allow just the desired species to be retrieved.
Exclude fragments
Unfortunately many sequences have entered the database as fragments. Where possible SWISS-PROT databases identify these. This is important because fragments can confound every type of sequence analysis and inference from sequence for the unwary. By default these are excluded from retrieval: only retrieve them with caution.
Cutoff value
This uses the Expect or E-value significance assessment provided by BLAST: essentially the number of false positive sequences expected to occur by chance for that score. The default of e-3 is usually safe for globular protein homologues but, if the SEG filter is turned off and the query has reduced sequence complexity, it becomes highly unreliable.
Update Sequence Selection
After modifying other parameters like species selection, this button updates the selected entries.
Invert Selection
Inverts the selection so that the previously selected sequences are now excluded from being taken.
Get Sequences in FASTA format
Collects the selected sequences in FASTA format which is the most widely used flat file format for loading into other software such as ClustalX for multiple sequence alignment. The parser attempts to provide informative names for the retrieved sequences. The name >Q98S45_GUITH_TFIIB in the example below is generated by concatenating the entry name, 5 chars of the species (Guillardia theta) and the gene name (if known). This makes it possible to keep some overview of what the sequences are in subsequent manipulations.
This option will be slow if there are many sequences to collect as it currently retrieves the information from the large database flat file (> 1 million entries for SWISSALL).
Two of the hits from the TFIIB example retrieved in FASTA format:
>Q98S45_GUITH_TFIIB ::Guillardia theta (Cryptomonas phi)::TFIIB.::Transcription initiation factor IIB. MNTCINCGSKRFLEDYKQGDIICKNCGFIIESHIIDFGSEWRIFSDDNRSNNPVRIGLPE NPLLGNSSSTLISKGLKGSNKINEKLLKAQNQNDNCKSEKYLASVFSIISFFLTNGSFSK LIKEKVQELFKNYYDYLTLKSNGSRIKTTLRKKDTFSIIAASIFIICKNESIPRSFKEIS ELTKVKKKDIGNRVRIMEKALEGIKISKKRDSDNFISRFCSKLGLSSTSSKIAEQIANFI KDKEGMYGRNYISVAAASIYVVSQIPNLSNNCNLKKIIEATGVSEITLRSAYKAMYPYRK EILLKIKNKESLICNSVFSNLTITN >Q9LIA6_ARATH_NONE ::Arabidopsis thaliana (Mouse-ear cress) ::none ::Transcription initiation factor IIB (TFIIB)-like protein. MEEETCLDCKRPTIMVVDHSSGDTICSECGLVLEAHIIEYSQEWRTFASDDNHSDRDPNR VGAATNPFLKSGDLVTIIEKPKETASSVLSKDDISTLFRAHNQVKNHEEDLIKQAFEEIQ RMTDALDLDIVINSRACEIVSKYDGHANTKLRRGKKLNAICAASVSTACRELQLSRTLKE IAEVANGVDKKDIRKESLVIKRVLESHQTSVSASQAIINTGELVRRFCSKLDISQREIMA IPEAVEKAENFDIRRNPKSVLAAIIFMISHISQTNRKPIREIGIVAEVVENTIKNSVKDM YPYALKIIPNWYACESDIIKRLDGVITSWDSAKFSV
Try out these SRS keyword examples for the TFIIB query
archaea archaea & pyrococcus archaea ! pyrococcus bacteria metazoa metazoa | fungi fungi & BRF1