HOME

Help for the BLAST2SRS sequence retrieval server

Why BLAST2SRS?

Why have we gone to the trouble of providing yet another BLAST-based web server when there are many available? The reason is that we wanted to be able to retrieve homologous sequences in an easy-to-use and flexible way not provided by other servers. Many researchers, including ourselves, revisit the same protein family many times with BLAST. During a project, the searches may have different goals e.g. checking for new entries or desiring to retrieve entries from a particular species. This means that we want to be able to efficiently retrieve sequences according to some classification of momentary interest, while ignoring all the other sequences that have been found.

Another frequent situation arises where we are BLASTing with protein families that we are perhaps less familiar with but we are after a limited set of sequences that match some general goals we have in mind: what we find in the output will affect what we will decide to retrieve. In such cases, information carried through into the output is essential to rapidly assess what the output set consists of. Unfortunately the uninformative BLAST outputs for databases like SPTREMBL are a scourge of users.

So, to get to the point, the key features of BLAST2SRS are

  1. Bringing critical information into the BLAST output to inform the user
  2. Providing flexible SRS-based controls in the output so that the user can collect sequences by combining keywords with the BLAST output information.

About BLAST2SRS

BLAST2SRS is a BLAST server that calls SRS - the powerful online Sequence Retrieval System - to collect protein sequence entries from the indexed databases. It only supports the SWISS-PROT family of databases because we need consistent database formating and SWISS-PROT itself is the most feature rich protein sequence database.

This list illustrates some typical selection criteria where BLAST2SRS is an appropriate tool:

To learn more about BLAST and SRS, follow the links to read the documentation provided by the developers of these tools.




A tutorial for BLAST2SRS

Setting up the query in BLAST2SRS

As an example, collect the entry TF2B_Human from SWISS-PROT. You can get it from the EMBL SRS server or the EBI SRS server. Cut and paste just the sequence into the sequence box.

Next you must review the search parameterisation. The most important of these is to select an appropriate database of sequences to search against. All other parameters can be left on default although they can strongly affect the results under particular circumstances, especially the reduced sequence complexity filter SEG. This applies to all BLAST servers and if you are not familiar with these parameters, you should bone up on the BLAST home page at the NCBI. Select SWISSALL for the TFIIB query.

Databases available

UniProt-Swiss-Prot Release 50.7 Sep-2006 contained 232,345 entries.
The best annotated protein database but limited in sequence coverage. Choose for testing purposes or for browsing quality annotation. Major Updates are once or twice per year but regular small updates are released too.

UniProt-TrEMBL Release 33.0 May-2006 contained 2,948,323 entries.
Automatically annotated entries carrying through translational annotation from the EMBL and other DNA sequence databases. Also automated annotation and classification. Much larger dataset than SWISS-PROT but very limited and cryptic annotation. Entries are gradually migrated by hand into SWISS-PROT. Full updates are once or twice per year but regular small updates are released too.

UniProt
The combined database including both SWISS-PROT and TrEMBL. A comprehensive collection of protein sequences in the general public databases. Select this database to search all sequences that have so far been incorporated.

(There may be specialised databases on-line with useful sequences that are not yet incorporated, e.g. for organism-specific genome projects.)



Now click the [Run Blast] box to launch the job. The time taken by the server to return output is influenced by the length of sequence and number of hits. Usually it takes a minute, but do not panic if it takes considerably longer. You are able to BookMark the output page: recommended for serious work.

The BLAST2SRS output

The output page looks generally like other BLAST outputs but has a few additions that are currently not provided elsewhere, certainly not all together. All these additions are aimed at helping to retrieve sequences of interest and avoid those that are not wanted. See that the hit table provides the species, gene name and fragment identifier for each entry. These have not usually been provided to the user yet are key data for evaluating entries.

The TF2B_example finds about 70 sequences with good E-value in SWISSALL as of (2.03).

In the header the following options are available.

Species list
The list of species found in the hits helps you check if your favourites are in there. Tick boxes allow just the desired species to be retrieved.

Exclude fragments
Unfortunately many sequences have entered the database as fragments. Where possible SWISS-PROT databases identify these. This is important because fragments can confound every type of sequence analysis and inference from sequence for the unwary. By default these are excluded from retrieval: only retrieve them with caution.

Cutoff value
This uses the Expect or E-value significance assessment provided by BLAST: essentially the number of false positive sequences expected to occur by chance for that score. The default of e-3 is usually safe for globular protein homologues but, if the SEG filter is turned off and the query has reduced sequence complexity, it becomes highly unreliable.

Update Sequence Selection
After modifying other parameters like species selection, this button updates the selected entries.

Invert Selection
Inverts the selection so that the previously selected sequences are now excluded from being taken.

Get Sequences in FASTA format
Collects the selected sequences in FASTA format which is the most widely used flat file format for loading into other software such as ClustalX for multiple sequence alignment. The parser attempts to provide informative names for the retrieved sequences. The name >Q98S45_GUITH_TFIIB in the example below is generated by concatenating the entry name, 5 chars of the species (Guillardia theta) and the gene name (if known). This makes it possible to keep some overview of what the sequences are in subsequent manipulations.

This option will be slow if there are many sequences to collect as it currently retrieves the information from the large database flat file (> 1 million entries for SWISSALL).

Two of the hits from the TFIIB example retrieved in FASTA format:

>Q98S45_GUITH_TFIIB ::Guillardia theta (Cryptomonas phi)::TFIIB.::Transcription initiation factor IIB.
MNTCINCGSKRFLEDYKQGDIICKNCGFIIESHIIDFGSEWRIFSDDNRSNNPVRIGLPE
NPLLGNSSSTLISKGLKGSNKINEKLLKAQNQNDNCKSEKYLASVFSIISFFLTNGSFSK
LIKEKVQELFKNYYDYLTLKSNGSRIKTTLRKKDTFSIIAASIFIICKNESIPRSFKEIS
ELTKVKKKDIGNRVRIMEKALEGIKISKKRDSDNFISRFCSKLGLSSTSSKIAEQIANFI
KDKEGMYGRNYISVAAASIYVVSQIPNLSNNCNLKKIIEATGVSEITLRSAYKAMYPYRK
EILLKIKNKESLICNSVFSNLTITN

>Q9LIA6_ARATH_NONE ::Arabidopsis thaliana (Mouse-ear cress) ::none ::Transcription initiation factor IIB (TFIIB)-like protein.
MEEETCLDCKRPTIMVVDHSSGDTICSECGLVLEAHIIEYSQEWRTFASDDNHSDRDPNR
VGAATNPFLKSGDLVTIIEKPKETASSVLSKDDISTLFRAHNQVKNHEEDLIKQAFEEIQ
RMTDALDLDIVINSRACEIVSKYDGHANTKLRRGKKLNAICAASVSTACRELQLSRTLKE
IAEVANGVDKKDIRKESLVIKRVLESHQTSVSASQAIINTGELVRRFCSKLDISQREIMA
IPEAVEKAENFDIRRNPKSVLAAIIFMISHISQTNRKPIREIGIVAEVVENTIKNSVKDM
YPYALKIIPNWYACESDIIKRLDGVITSWDSAKFSV

Type in SRS query box

Type in one or more keywords for selecting sequences. Multiple keywords must be separated by a logical operator. The operator choices are & (and) | (or) ! (but not). To see how these work we can explore the taxonomic range where TFIIB sequences are found.

Try out these SRS keyword examples for the TFIIB query

archaea

archaea & pyrococcus 

archaea ! pyrococcus

bacteria              

metazoa 

metazoa | fungi

fungi & BRF1

Send query to SRS

Collect the sequences which have the SRS keyword(s) and are selected in the output. This takes you to a list page provided by SRS where you can choose the sequence format etc. for retrieval. Most often, you will want to use the Save option with the Complete Entries or FastaSeqs options. See the help in the main page of SRS, if you are not familiar with SRS.

HOME