Elementary Sequence Analysis


Database Searching

edited by B.Golding, Jan 1996

Contents

Return to index


Are there homologues in the database?

The following are some of the common programs currently being used to search the databases to find sequences similar to a specific query sequence provided by the user. In addition to finding out the identity of an unknown sequence they are also useful to find homologues and ancestral sequences that have similar or related functions/sequences. Each of these are running via web sites and by e-mail servers - that is, an e-mail message is received by a remote machine, interpreted, submitted to a job queue, and the results returned by mail. Because you are not communicating with a real person you must follow a strict input format (otherwise the computer will mail you an error message).


FASTA

To search through the whole genetic sequence database can take a great deal of time due to its size. If some operation must be performed on each sequence then this can take even longer. One such example is to look throughout the whole database for homologous or similar sequences. To do this, special programs have been developed to speed the search. The first amongst these programs is a program called FASTA written by W.R. Pearson and D.J. Lipman (1988, PNAS 85:2444-2448).

It is possible to run this program on remote machines. The obvious choice for such a remote machine would be one that has access to the latest sequence information. Both EMBL and DDBJ have permitted this type of access and have implemented FASTA type searches through their machines (NCBI prefers to use BLAST - see below).

There are several flavours to FASTA: fasta scans a protein or DNA sequence library for sequences similar to a query sequence. tfasta compares a protein query sequence to the DNA sequence library, translating the DNA sequence on the fly. lfasta compares two query sequences for local similarity between them and shows the local sequence alignments. plfasta compares two sequences for local similarity and plots the local sequence alignments.

I will illustrate what a FASTA type of search is and what the results look like with an example. Basically the idea is to search through the complete database for any similar sequence.

Instructions

To carry out this type of search on the EMBL server the following must be done. Set up a file containing the following

LIB SWALL
WORD 1
LIST 50
TITLE HALHA
SEQ
PTVEYLNYETLDDQGWDMDDDDLFEKAADAGLDGEDYGTMEVAEGEYILEAAEAQGYDWP
FSCRAGACANCASIVKEGEIDMDMQQILSDEEVEEKDVRLTCIGSPAADEVKIVYNAKHL
DYLQNRVI
The first line contains the data library files to be searched (in this case all Swiss-Prot and NBRF/PIR entries). It may be EMALL (all EMBL entries plus those in the latest release), or GENEMBL (GenBank plus EMBL), or EPRI (EMBL primate entries), etc. The second line gives the word size or k-tuple value (more on this below). The third line says to LIST on the output the top 50 scores. The TITLE line is used for the subject of the mail message. Finally SEQ implies that everything below this line to the end of the message is part of the sequence. In this case the sequence is the protein sequence of the ferredoxin gene of Halobacterium halobium. The other options available for LIB are...

The remaining options are - LIST n, n top scores listed in the output [50]. ALIGN n, align the top n to the query sequence [10]. ONE, compare only the given strand to the database, the default is to use the complementary strand as well. PROT will force your query sequence to be a protein (small protein sequences may be otherwise misinterpreted as DNA). PATH string mails the results back to string rather than the originator of the message.

After creating this file, mail the file by electronic mail to

fasta@ebi.ac.uk

and the results will be sent back to you by electronic mail.

Alternatively point your web browser to FASTA.

PLEASE NOTE - as a courtesy to others using the system please send only one job at a time. Many other people from all over the world are using this server and the FASTA program is quite computer intensive despite its speed.


FASTA output

An example of the output is shown below. The input file is specifying the Halobacterium halobium ferredoxin amino acid sequence to search the SWISS-PROT database.


(Peptide) FASTA of: 260117af.Seq  from: 1 to: 128  February 3, 1996  19:43



 TO: SWALL:*  Sequences:     51,998  Symbols: 18,448,967  Word Size: 1


Score Init1 Initn
<  2     0     0:
   4     0     0:
   6     9     9:=====
   8     7     7:====
  10    83    83:==========================================
  12   160   160:==================================================
  14   191   191:==================================================
  16   362   362:==================================================
  18   768   768:==================================================
  20  1279  1279:==================================================
  22  2370  2370:==================================================
  24  3774  3774:==================================================
  26  5417  5417:==================================================
  28  6770  6770:==================================================
  30  6878  6878:==================================================
  32  6299  6244:==================================================
  34  5114  4762:==================================================
  36  3941  3487:==================================================
  38  2686  2350:==================================================
  40  2087  1747:==================================================
  42  1319  1105:==================================================
  44   897   798:==================================================
  46   567   722:==================================================
  48   312   556:==================================================
  50   248   486:==================================================
  52   121   375:==================================================
  54    93   329:===============================================+++
  56    60   230:==============================++++++++++++++++++++
  58    37   179:===================+++++++++++++++++++++++++++++++
  60    35   153:==================++++++++++++++++++++++++++++++++
  62    12    93:======+++++++++++++++++++++++++++++++++++++++++
  64    10    58:=====++++++++++++++++++++++++
  66     4    56:==++++++++++++++++++++++++++
  68     3    32:==++++++++++++++
  70     2    18:=++++++++
  72     5    23:===+++++++++
  74     0    15:++++++++
  76     0    13:+++++++
  78     1     8:=+++
  80     0     7:++++
> 80    77    84:=======================================+++
 mean initn score:  23.6 (3.43)
 mean init1 score:  23.6 (3.43)


The best scores are:                                        init1 initn opt..

Sw:Fer_Halha  P00216 halobacterium halobium. ferredoxin. ... 635   635   635
Sw:Fer_Halsp  P00217 halobacterium sp. ferredoxin. 11/88     571   571   571
Sw:Fer_Synp4  P15788 synechococcus sp. (strain pcc 7418) ... 182   182   210
Sw:Fer_Galsu  P00241 galdieria sulphuraria (cyanidium cal... 163   182   188
Sw:Fer1_Anava  P00254 anabaena variabilis, and anabaena s... 180   180   203
Sw:Fer2_Nosmu  P00249 nostoc muscorum. ferredoxin ii. 11/88  179   179   200
Sw:Fer1_Synp7  P06517 synechococcus sp. (strain pcc 7942)... 162   179   185
Sw:Fer1_Anasp  P06543 anabaena sp. (strain pcc 7120). fer... 176   176   205
Sw:Fer_Synsp  P00256 synechococcus sp. ferredoxin. 11/88     175   175   199
Sw:Fer_Synli  P00255 synechococcus lividus. ferredoxin. 1... 175   175   199
Sw:Fer2_Spiol  P00224 spinacia oleracea (spinach). ferred... 157   175   180
Sw:Fer_Marpo  P09735 marchantia polymorpha (liverwort). f... 158   174   186
Sw:Fer_Nosmu  P00253 nostoc muscorum. ferredoxin. 11/88      174   174   203
Sw:Fer_Gleja  P00233 gleichenia japonica (urajiro) (fern)... 158   174   180
Sw:Fer1_Cyapa  P17007 cyanophora paradoxa. ferredoxin i. ... 159   174   180
Sw:Fer_Rhopl  P07484 rhodymenia palmata (dulse). ferredox... 157   172   180
Sw:Fer_Chlfr  P00247 chlorogloeopsis fritschii. ferredoxi... 168   168   189
Sw:Fer_Scequ  P00238 scenedesmus quadricauda. ferredoxin.... 153   168   174
Sw:Fer1_Nosmu  P00252 nostoc muscorum. ferredoxin i. 11/88   166   166   191
Sw:Fer1_Orysa  P11051 oryza sativa (rice). ferredoxin i. ... 152   165   174
Sw:Fer_Eugvi  P22341 euglena viridis. ferredoxin. 8/91       164   164   190
Sw:Fer5_Maize  P27789 zea mays (maize). ferredoxin v prec... 142   163   173
Sw:Fer_Masla  P00248 mastigocladus laminosus (fischerella... 162   162   186
Sw:Fer_Chlre  P07839 chlamydomonas reinhardtii. ferredoxi... 147   162   168
Sw:Fer1_Equte  P00234 equisetum telmateia (giant horsetai... 160   160   188
Sw:Fer1_Maize  P27787 zea mays (maize). ferredoxin i prec... 145   160   170
Sw:Fer1_Equar  P00235 equisetum arvense (field horsetail)... 159   159   187
Sw:Fer_Bryma  P07838 bryopsis maxima. ferredoxin. 2/94       159   159   181
Sw:Fer_Wheat  P00228 triticum aestivum (wheat). ferredoxi... 156   156   177
Sw:Fer2_Dunsa  P00240 dunaliella salina. ferredoxin ii. 2/94 154   154   175
Sw:Fer1_Spiol  P00221 spinacia oleracea (spinach). ferred... 154   154   171
Sw:Fer1_Dunsa  P00239 dunaliella salina. ferredoxin i. 2/94  151   151   175
Sw:Fer_Porum  P00242 porphyra umbilicalis (laver). ferred... 151   151   179
Sw:Fer_Perbi  P10770 peridinium bipes (dinoflagellate). f... 137   151   159
Sw:Fer_Spipl  P00246 spirulina platensis. ferredoxin. 3/92   149   149   180
Sw:Fer1_Phyes  P00230 phytolacca esculenta (food pokeberr... 148   148   178
Sw:Fer_Brana  P00227 brassica napus (rape). ferredoxin. 2/94 136   148   161
Sw:Fer_Aphsa  P00250 aphanothece sacrum. ferredoxin i. 5/92  132   148   167
Sw:Fer_Leugl  P00225 leucaena glauca (white popinac) (leu... 147   147   170
Sw:Fer1_Phyam  P00229 phytolacca americana (common pokebe... 147   147   177
Sw:Fer_Bumfi  P13106 bumilleriopsis filiformis. ferredoxi... 132   147   167
Sw:Fer_Spima  P00245 spirulina maxima. ferredoxin. 11/88     147   147   178
Sw:Fer_Syny4  P00243 synechocystis sp. (strain pcc 6714).... 146   146   178
Sw:Fer2_Rapsa  P14937 raphanus sativus (radish). ferredox... 146   146   179
Sw:Ferh_Anava  P46046 anabaena variabilis. ferredoxin, he... 146   146   161
Sw:Fer_Syny3  P27320 synechocystis sp. (strain pcc 6803).... 146   146   179
Sw:Fer1_Rapsa  P14936 raphanus sativus (radish). ferredox... 145   145   181
Sw:Fer2_Plebo  P46035 plectonema boryanum. ferredoxin ii ... 144   144   161
Sw:Ferh_Anasp  P11053 anabaena sp. (strain pcc 7120). fer... 144   144   159
Swnew:Fer2_Plebo  P46035 FERREDOXIN II (FDII). 2/96          144   144   161
Sw:Fer_Coles  P00222 colocasia esculenta (elephant's ear)... 144   144   168
Sw:Fer_Arath  P16972 arabidopsis thaliana (mouse-ear cres... 144   144   166
Sw:Fer1_Synp2  P31965 synechococcus sp. (strain pcc 7002)... 144   144   173
Sw:Fer1_Aphfl  P00244 aphanizomenon flos-aquae. ferredoxi... 143   143   177
Sw:Fer3_Rapsa  P14938 raphanus sativus (radish). ferredox... 143   143   166
Sw:Fer_Silpr  P04669 silene pratensis (white campion) (ly... 143   143   166
Sw:Ferh_Fredi  P28610 fremyella diplosiphon (calothrix pc... 142   142   158
Sw:Fer_Samni  P00226 sambucus nigra (european elder). fer... 140   140   163
Sw:Fer_Medsa  P00220 medicago sativa (alfalfa). ferredoxi... 139   139   159
Sw:Fer2_Phyam  P00231 phytolacca americana (common pokebe... 139   139   164


260117af.Seq
Sw:Fer_Halha

ID   FER_HALHA      STANDARD;      PRT;   128 AA.
AC   P00216;
DT   21-JUL-1986 (REL. 01, CREATED)
DT   21-JUL-1986 (REL. 01, LAST SEQUENCE UPDATE)
DT   01-JUN-1994 (REL. 29, LAST ANNOTATION UPDATE)
DE   FERREDOXIN. . . . 

SCORES     Init1: 635 Initn: 635 Opt: 635
           100.0% identity in 128 aa overlap

               10        20        30        40        50        60
260117 PTVEYLNYETLDDQGWDMDDDDLFEKAADAGLDGEDYGTMEVAEGEYILEAAEAQGYDWP
       ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Fer_Ha PTVEYLNYETLDDQGWDMDDDDLFEKAADAGLDGEDYGTMEVAEGEYILEAAEAQGYDWP
               10        20        30        40        50        60

               70        80        90       100       110       120
260117 FSCRAGACANCASIVKEGEIDMDMQQILSDEEVEEKDVRLTCIGSPAADEVKIVYNAKHL
       ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Fer_Ha FSCRAGACANCASIVKEGEIDMDMQQILSDEEVEEKDVRLTCIGSPAADEVKIVYNAKHL
               70        80        90       100       110       120

               
260117 DYLQNRVI
       ||||||||
Fer_Ha DYLQNRVI
               


260117af.Seq
Sw:Fer_Halsp

ID   FER_HALSP      STANDARD;      PRT;   128 AA.
AC   P00217;
DT   21-JUL-1986 (REL. 01, CREATED)
DT   21-JUL-1986 (REL. 01, LAST SEQUENCE UPDATE)
DT   01-NOV-1988 (REL. 09, LAST ANNOTATION UPDATE)
DE   FERREDOXIN. . . . 

SCORES     Init1: 571 Initn: 571 Opt: 571
            84.4% identity in 128 aa overlap

               10        20        30        40        50        60
260117 PTVEYLNYETLDDQGWDMDDDDLFEKAADAGLDGEDYGTMEVAEGEYILEAAEAQGYDWP
       |||||||||::||:|||| |||:|::|:| :||:||||::||:|||||||||||||||||
Fer_Ha PTVEYLNYEVVDDNGWDMYDDDVFGEASDMDLDDEDYGSLEVNEGEYILEAAEAQGYDWP
               10        20        30        40        50        60

               70        80        90       100       110       120
260117 FSCRAGACANCASIVKEGEIDMDMQQILSDEEVEEKDVRLTCIGSPAADEVKIVYNAKHL
       ||||||||||||:|| ||:|||||||||||||||:|:|||||||||:|||||||||||||
Fer_Ha FSCRAGACANCAAIVLEGDIDMDMQQILSDEEVEDKNVRLTCIGSPDADEVKIVYNAKHL
               70        80        90       100       110       120

               
260117 DYLQNRVI
       ||||||||
Fer_Ha DYLQNRVI
               


260117af.Seq
Sw:Fer_Synp4

ID   FER_SYNP4      STANDARD;      PRT;    98 AA.
AC   P15788;
DT   01-APR-1990 (REL. 14, CREATED)
DT   01-APR-1990 (REL. 14, LAST SEQUENCE UPDATE)
DT   01-NOV-1990 (REL. 16, LAST ANNOTATION UPDATE)
DE   FERREDOXIN. . . . 

SCORES     Init1: 182 Initn: 182 Opt: 210
            48.8% identity in 82 aa overlap

       10        20        30        40        50        60        
260117 ETLDDQGWDMDDDDLFEKAADAGLDGEDYGTMEVAEGEYILEAAEAQGYDWPFSCRAGAC
                                     |:||:::||||::||::| | |:|||||||
Fer_Sy               ASYKVTLINEEMGLNETIEVPDDEYILDVAEEEGIDLPYSCRAGAC
                             10        20        30        40      

       70        80        90       100       110       120        
260117 ANCASIVKEGEIDMDMQQILSDEEVEEKDVRLTCIGSPAADEVKIVYNAKHLDYLQNRVI
       ::||: :|||||| : | :|:|:::|:  | |||:: ||:| : |::::::|        
Fer_Sy STCAGKIKEGEIDQSDQSFLDDDQIEAGYV-LTCVAYPASDCTIITHQEEELY       
         50        60        70         80        90               


260117af.Seq
Sw:Fer_Galsu

ID   FER_GALSU      STANDARD;      PRT;    98 AA.
AC   P00241;
DT   21-JUL-1986 (REL. 01, CREATED)
DT   21-JUL-1986 (REL. 01, LAST SEQUENCE UPDATE)
DT   01-OCT-1994 (REL. 30, LAST ANNOTATION UPDATE)
DE   FERREDOXIN. . . . 

SCORES     Init1: 163 Initn: 182 Opt: 188
            41.5% identity in 82 aa overlap

       10        20        30        40        50        60        
260117 ETLDDQGWDMDDDDLFEKAADAGLDGEDYGTMEVAEGEYILEAAEAQGYDWPFSCRAGAC
                                     |:| ::::|||:|||:|| | |:|||||||
Fer_Ga               ASYKIHLVNKDQGIDETIECPDDQYILDAAEEQGLDLPYSCRAGAC
                             10        20        30        40      

       70        80        90       100       110       120        
260117 ANCASIVKEGEIDMDMQQILSDEEVEEKDVRLTCIGSPAADEVKIVYNAKHLDYLQNRVI
       ::||: : |||:| : | :|:|::|:: :  |||:: |::::: :::::: |        
Fer_Ga STCAGKLLEGEVDQSDQSFLDDDQVKA-GFVLTCVAYPTSNATILTHQEESLY       
         50        60        70         80        90               


260117af.Seq
Sw:Fer1_Anava

ID   FER1_ANAVA     STANDARD;      PRT;    98 AA.
AC   P00254;
DT   21-JUL-1986 (REL. 01, CREATED)
DT   01-NOV-1988 (REL. 09, LAST SEQUENCE UPDATE)
DT   01-FEB-1994 (REL. 28, LAST ANNOTATION UPDATE)
DE   FERREDOXIN I. . . . 

SCORES     Init1: 180 Initn: 180 Opt: 203
            45.1% identity in 82 aa overlap

       10        20        30        40        50        60        
260117 ETLDDQGWDMDDDDLFEKAADAGLDGEDYGTMEVAEGEYILEAAEAQGYDWPFSCRAGAC
                                     |::|:::||||:|||:|||| |||||||||
Fer1_A               ATFKVTLINEAEGTSNTIDVPDDEYILDAAEEQGYDLPFSCRAGAC
                             10        20        30        40      

       70        80        90       100       110       120        
260117 ANCASIVKEGEIDMDMQQILSDEEVEEKDVRLTCIGSPAADEVKIVYNAKHLDYLQNRVI
       ::||: : :|::| : | :|:|:::|:  | |||:: |::| :  ::::::|        
Fer1_A STCAGKLVSGTVDQSDQSFLDDDQIEAGYV-LTCVAYPTSDVTIQTHKEEDLY       
         50        60        70         80        90               


260117af.Seq
Sw:Fer2_Nosmu

ID   FER2_NOSMU     STANDARD;      PRT;    98 AA.
AC   P00249;
DT   21-JUL-1986 (REL. 01, CREATED)
DT   21-JUL-1986 (REL. 01, LAST SEQUENCE UPDATE)
DT   01-NOV-1988 (REL. 09, LAST ANNOTATION UPDATE)
DE   FERREDOXIN II. . . . 

SCORES     Init1: 179 Initn: 179 Opt: 200
            45.1% identity in 71 aa overlap

       10        20        30        40        50        60        
260117 ETLDDQGWDMDDDDLFEKAADAGLDGEDYGTMEVAEGEYILEAAEAQGYDWPFSCRAGAC
                                     |:||:::||||:|||::| | |||||:|:|
Fer2_N               ATYKVRLFNAAEGLDETIEVPDDEYILDAAEEAGLDLPFSCRSGSC
                             10        20        30        40      

       70        80        90       100       110       120        
260117 ANCASIVKEGEIDMDMQQILSDEEVEEKDVRLTCIGSPAADEVKIVYNAKHLDYLQNRVI
       ::|::|:|:|::| : |::|:|::::: :| |||:: |:::                   
Fer2_N SSCNGILKKGTVDQSDQNFLDDDQIAAGNV-LTCVAYPTSNCEIETHREDAIA       
         50        60        70         80        90               


260117af.Seq
Sw:Fer1_Synp7

ID   FER1_SYNP7     STANDARD;      PRT;    98 AA.
AC   P06517;
DT   01-JAN-1988 (REL. 06, CREATED)
DT   01-JAN-1988 (REL. 06, LAST SEQUENCE UPDATE)
DT   01-AUG-1990 (REL. 15, LAST ANNOTATION UPDATE)
DE   FERREDOXIN I. . . . 

SCORES     Init1: 162 Initn: 179 Opt: 185
            41.5% identity in 82 aa overlap

       10        20        30        40        50        60        
260117 ETLDDQGWDMDDDDLFEKAADAGLDGEDYGTMEVAEGEYILEAAEAQGYDWPFSCRAGAC
                                     |::||:::|||:|||:|| | |:|||||||
Fer1_S               ATYKVTLVNAAEGLNTTIDVADDTYILDAAEEQGIDLPYSCRAGAC
                             10        20        30        40      

       70        80        90       100       110       120        
260117 ANCASIVKEGEIDMDMQQILSDEEVEEKDVRLTCIGSPAADEVKIVYNAKHLDYLQNRVI
       ::||: | :|::| : | :|:|::::: :  |||:: |::| :  ::::::|        
Fer1_S STCAGKVVSGTVDQSDQSFLDDDQIAA-GFVLTCVAYPTSDVTIETHKEEDLY       
         50        60        70         80        90               



......................................................
..................Material deleted....................
......................................................



260117af.Seq
Sw:Fer_Synsp

ID   FER_SYNSP      STANDARD;      PRT;    97 AA.
AC   P00256;
DT   21-JUL-1986 (REL. 01, CREATED)
DT   21-JUL-1986 (REL. 01, LAST SEQUENCE UPDATE)
DT   01-NOV-1988 (REL. 09, LAST ANNOTATION UPDATE)
DE   FERREDOXIN. . . . 

SCORES     Init1: 175 Initn: 175 Opt: 199
            48.8% identity in 84 aa overlap

       10        20        30        40        50        60        
260117 ETLDDQGWDMDDDDLFEKAADAGLDGEDYGTMEVAEGEYILEAAEAQGYDWPFSCRAGAC
                                     |::|:|:||||::||:|| | |||||||||
Fer_Sy                ATYKVTLVRPDGSETTIDVPEDEYILDVAEEQGLDLPFSCRAGAC
                              10        20        30        40     

       70        80        90       100       110       120        
260117 ANCASIVKEGEIDMDMQQILSDEEVEEKDVRLTCIGSPAADEVKIVYNAKHLDYLQNRVI
       ::||: : |||:| : | :|:|::: ||:  |||:: | :|  ||: |:::  |      
Fer_Sy STCAGKLLEGEVDQSDQSFLDDDQI-EKGFVLTCVAYPRSD-CKILTNQEEELY      
          50        60        70         80         90             


260117af.Seq
Sw:Fer_Synli

ID   FER_SYNLI      STANDARD;      PRT;    96 AA.
AC   P00255;
DT   21-JUL-1986 (REL. 01, CREATED)
DT   21-JUL-1986 (REL. 01, LAST SEQUENCE UPDATE)
DT   01-NOV-1988 (REL. 09, LAST ANNOTATION UPDATE)
DE   FERREDOXIN. . . . 

SCORES     Init1: 175 Initn: 175 Opt: 199
            46.3% identity in 82 aa overlap

       10        20        30        40        50        60        
260117 ETLDDQGWDMDDDDLFEKAADAGLDGEDYGTMEVAEGEYILEAAEAQGYDWPFSCRAGAC
                                     |::|:|:||||::||:|| | |||||||||
Fer_Sy                 ATYKVTLVRPDGETTIDVPEDEYILDVAEEQGLDLPFSCRAGAC
                               10        20        30        40    

       70        80        90       100       110       120        
260117 ANCASIVKEGEIDMDMQQILSDEEVEEKDVRLTCIGSPAADEVKIVYNAKHLDYLQNRVI
       ::||: : |||:| : | :|:|::: ||:  |||:: | :|   :::::::|        
Fer_Sy STCAGKLLEGEVDQSDQSFLDDDQI-EKGFVLTCVAYPRSDCKILTHQEEELY       
           50        60         70        80        90             


 CPU time:  0:01:02
 Output File: Local$Scratch:260117af.Res;



FASTA format

The output from the FASTA search begins with some informational messages. This includes a listing of all the amino acids and sequences searched (note the size of the former numbers).

Next comes a histogram (lying on its side) of the number of sequences found with various scores (init1 and initn scores). Each symbol in this graph is an indicator of two sequences (the = symbol indicates both init1 and initn scores, the - symbol indicates just init1 and the + symbol indicates just initn scores). This histogram gives you an indication of how similar the query sequence is to some of the database sequences. For a query sequence that has found a significant match, it should be well out of the tail of the init1 and initn distributions. In the above example there are many sequences with init1/initn scores much higher than the scores obtained from the rest of the sequences and you will find these are all related ferredoxin sequences from other species. Their presence at this extreme end of the distribution indicates that there is much greater and significant similarity between these sequences and the query sequence than between general sequences in the database and the query sequence.

Next comes a section that lists the sequences (along with their locus names) that have the best scores. Finally there is a section that lists the alignments that have been found by the program.

In comparing a query sequence to the database two scores are calculated for each and every entry in the database. These scores are initn and init1. A third score, opt, will be calculated for some of the top scoring entries.

The first thing that is done is to establish a matrix containing words (from the sequence) of variable length. The length of these words (e.g. ATCG or MKR) is set by the WORD or k-tuple value. By default it is 6 for nucleic acids and 2 for amino acid searches. A lower k-tuple will give a more sensitive search but will take much longer. Although a range of 3 to 6 is permitted for nucleic acids a lower value is generally unnecessary. All places in the sequence are determined where the k-tuple from both sequences agree perfectly. Then those regions with the highest density of these identities are found.

An init1 score is assigned to each of these regions of high similarity after the regions are extended at the ends to include regions shorter than the length of a k-tuple and after using a PAM250 matrix (alternative distance matrices are available - more on these later) to score mismatches.

Groups of larger regions are attempted to be joined together and an initn score is generated from these. This is done by setting initn equal to the sum of the two init1 scores for each region (the final init1 score of a sequence is the maximum init1 score from all interior regions). A constant of 20 is then subtracted as a joining penalty. If the initn score is less than one of the init1 scores it is discarded, the regions are not joined and the initn score will be equal to the maximum init1 score (hence initn is greater than or equal to init1).

Sequences that have an initn score larger than a cutoff value (usually 50 but this can be altered with a "LIST n" command in the query file) are then used for a Smith-Waterman alignment (see the section on alignments) and an OPT score is generated from these alignments. Only the region considered significant by the program is displayed. In these alignments, the name of the sequence will be presented, the scores, and the percent similarity over the region aligned. In general the length of the region aligned is a better indicator of homology than is the percent similarity. This is because large percentages can be found in short regions just by chance. A "|" is used to indicate a complete match, a ":" to indicate a conservative amino acid replacement, and a "-" to indicate a deletion/insertion.

Note that the opt score can be lower than the initn score. This will happen when one sequence has two (or more) regions of high similarity separated by regions that have little/no homology. The two regions are joined with high init1 scores and the initn score is high because the gap penalty/join penalty is not sufficiently large. In contrast sequences with a large number of poorly similar regions will have low init1 scores but high initn scores and then low opt scores. In general, unless a very short sequence is used, the init1 score should be much improved by the opt score for truly significant sequences.

Remember to remove repetitive sequences from your query otherwise you will get a lot of false hits. The FASTA program itself can be obtained via anonymous FTP if desired.

Statistical Significance

With version 2.0 of the FASTA program distribution, FASTA, TFASTA, and SSEARCH now provide estimates of statistical significance for library searches. Work by Altschul, Arratia, Karlin, Mott, Waterman, and others (see Altschul et al. (1994) Nature Genetics 6:119 for an excellent review) suggests that local sequence similarity scores follow the extreme value distribution, so that P(s > x) = 1 - exp(-exp(-lambda(x-u)) where u = ln(Kmn)/lambda and m,n are the lengths of the query and library sequence. This formula can be rewritten as: 1 - exp(-Kmn exp(-lambda x), which shows that the average score for an unrelated library sequence increases with the logarithm of the length of the library sequence. FASTA and SSEARCH use simple linear regression against the the log of the library sequence length to calculate a normalized "z-score" with mean 50, regardless of library sequence length, and variance 10. These z-scores can then be used with the extreme value distribution and the poisson distribution (to account for the fact that each library sequence comparison is an independent test) to calculate the number of library sequences to obtain a score greater than or equal to the score obtained in the search. The original idea and routines to do the linear regression on library sequence length were provided Phil Green, U. Washington. This version of FASTA and SSEARCH uses a slightly different strategy for fitting the data than those originally provided by Dr. Green.

The expected number of sequences is plotted in the histogram using an "*". Since the parameters for the extreme value distribution are not calculated directly from the distribution of similarity scores, the pattern of "*'s" in the histogram gives a qualitative view of how well the statistical theory fits the similarity scores calculated by FASTA and SSEARCH. For FASTA, if optimized scores are calculated for each sequence in the database (-o option), the agreement between the actual distribution of "z-scores" and the expected distribution based on the length dependence of the score and the extreme value distribution is usually very good. Likewise, the distribution of SSEARCH Smith- Waterman scores typically agrees closely with the actual distribution of "z-scores." The agreement with unoptimized scores, ktup=2, is often not very good, with too many high scoring sequences and too few low scoring sequences compared with the predicted relationship between sequence length and similarity score. In those cases, the expectation values may be overestimates.

The statistical routines assume that the library contains a large sample of unrelated sequences. If this is not the case, then the expectation values are meaningless. Likewise, if there are fewer than 20 sequences in the library, the statistical calculations are not done.

A complete manual for FASTA can be consulted for further questions.


BLAST

While FASTA is a sensitive and rapid algorithm to search for similar sequences in the database it is not without problems (one of these being the snail-like pace of the ethernet connection to the FASTA server at EMBL). Because its initial step looks for perfect matches it will completely ignore more distantly related sequences that have functional homology but no longer retain complete identity. If an amino acid sequence has had many conserved replacements but no longer has identities then the FASTA algorithm will completely miss this region. Fortunately, alignments where there are extensive regions of low but not exact similarity are rare enough that a small WORD or k-tuple size will pick up most regions.

A different algorithm which improves upon FASTA in speed, if not in sensitivity, is termed BLAST (Basic Local Alignment Search Tool). This began with a statistical paper by Karlin and Altschul (PNAS 87:2264-2268, 1990) who developed a rigorous method to obtain the probabilities of matches with a query sequence given that no gaps are permitted. This permits the use of larger WORD or k-tuple sizes with the concomitant increase in speed but permitting inexact matches between WORDs. The statistical developments permit this to be done without loss of sensitivity and permit rigorous statistical statements to be made about the matches found.

As a result of these developments Altschul, Gish, Miller, Myers and Lipman (J.Mol.Biol. 215:403-415, 1990) created the BLAST group of programs. These algorithms find ungapped, locally optimal sequence alignments. There are several versions of the BLAST programs. These are ...

  • BLASTN - nucleotide query of the nucleotide database.
  • BLASTP - protein query of the protein database.
  • BLASTX - translate DNA to protein and query protein database.
  • TBLASTN - protein query of the translated nucleotide database.
  • TBLASTX - translate DNA to protein and query the translated nucleotide database.
  • BLAST3 - finds significant three-way alignments in which the pairwise alignments are insignificant.

To carry out this type of search on the NCBI server set up a file with the following

PROGRAM blastp
DATALIB nr
BEGIN
> Hal ha.
PTVEYLNYETLDDQGWDMDDDDLFEKAADAGLDGEDYGTMEVAEGEYILEAAEAQGYDWP
FSCRAGACANCASIVKEGEIDMDMQQILSDEEVEEKDVRLTCIGSPAADEVKIVYNAKHL
DYLQNRVI
and mail the file to blast@ncbi.nlm.nih.gov. You should receive the results by mail within a few minutes.

Only the input options shown above are mandatory, all others are optional. The input options are PROGRAM blastn, blastp, blastx, tblastx, blast3, or tblastn. DATALIB has several options like the LIB option of a FASTA search - the recommended option is "nr" which stands for non-redundant (it includes sequences from PDB, GenBank, GenBank updates, EMBL and EMBL updates or sequences from PDB, SWISS-PROT, PIR, GenPept and GenPept updates) but there are many others. HISTOGRAM [yes]/no will turn on/off the printing of a histogram of the scores. DESCRIPTIONS n, number of described matching sequences [100]. ALIGNMENTS n, number of high scoring pairs [50]. EXPECT n, the score such that n sequences should be found by chance alone [10] (a fractional value of one or less will give only output which is statistically unusual, larger values give more output). CUTOFF n, cutoff score for segment pairs - high scoring pairs are reported in the output only if one of them scores at least as high as the cutoff [calculated from and dependant on the value of EXPECT]. MATRIX determines the scoring matrix for protein comparisons [BLOSUM62]. STRAND restricts a search to the top or bottom strand (top, bottom, [both]). FILTER will mask parts of your query so that things like repetitive elements are ignored (FILTER seq - will exclude regions of low compositional complexity, FILTER xnu - will exclude regions with short-periodicity internal repeats, [exclude nothing]). The other options available are QOFFSET, GCODE, PATH, SPLIT, and ACKNOWLEDGE.

More information about the programs and their output can be obtained from their man(ual) pages. More information about the algorithm is also available. BLAST BLASTN and BLASTP run on faster queues than do BLASTX, TBLASTX, BLAST3 and TBLASTN. The program actually scans entries twice. Once to find the highest scoring pairs and then again with a lower CUTOFF to find potential combinations of high scores that together might do better than a single high score.

The BLAST programs themselves can be obtained if desired by anonymous FTP to ncbi.nlm.nih.gov.


BLAST output

After some informational material comes ...



BLASTP 1.4.8MP [20-June-1995] [Build 13:58:02 Oct 17 1995]

Reference:  Altschul, Stephen F., Warren Gish, Webb Miller, Eugene W. Myers,
and David J. Lipman (1990).  Basic local alignment search tool.  J. Mol. Biol.
215:403-10.

Query=  Hal ha.
        (128 letters)

Database:  Non-redundant PDB+SwissProt+SPupdate+PIR+GenPept+GPupdate
           173,745 sequences; 51,502,515 total letters.
Searching..................................................done


     Observed Numbers of Database Sequences Satisfying
    Various EXPECTation Thresholds (E parameter values)

        Histogram units:      = 28 Sequences     : less than 28 sequences

 EXPECTation Threshold
 (E parameter)
    |
    V   Observed Counts-->
  10000 6826 1707 |============================================================
   6310 5119 1608 |=========================================================
   3980 3511  999 |===================================
   2510 2512  699 |========================
   1580 1813  648 |=======================
   1000 1165  328 |===========
    631  837  237 |========
    398  600  157 |=====
    251  443   87 |===
    158  356   71 |==
    100  285   38 |=
   63.1  247   50 |=
   39.8  197   26 |:
   25.1  171   22 |:
   15.8  149    7 |:
 >>>>>>>>>>>>>>>>>>>>>  Expect = 10.0, Observed = 142  <<<<<<<<<<<<<<<<<
   10.0  142    8 |:
   6.31  134    5 |:
   3.98  129    5 |:
   2.51  124    3 |:
   1.58  121    1 |:
   1.00  120    1 |:
   0.63  119    1 |:
   0.40  118    1 |:
   0.25  117    0 |
   0.16  117    1 |:
   0.10  116    2 |:
  0.063  114    2 |:
  0.040  112    2 |:
  0.025  110    0 |
  0.016  110    1 |:
  0.010  109    2 |:
 0.0063  107    0 |
 0.0040  107    0 |
 0.0025  107    0 |
 0.0016  107    1 |:


                                                                     Smallest
                                                                       Sum
                                                              High  Probability
Sequences producing High-scoring Segment Pairs:              Score  P(N)      N

pir|S35235|S35235    ferredoxin [2Fe-2S] - Halobacterium ...   681  3.9e-89   1
sp|P00216|FER_HALHA  FERREDOXIN. >pir|A00220|FEHS ferredo...   681  3.9e-89   1
sp|P00217|FER_HALSP  FERREDOXIN. >pir|A00221|FEHSX ferred...   583  1.1e-75   1
sp|P15788|FER_SYNP4  FERREDOXIN. >pir|A28858|A28858 ferre...   176  1.7e-25   2
sp|P00241|FER_GALSU  FERREDOXIN. >pir|A00245|FEKK ferredo...   162  5.7e-22   2
sp|P00242|FER_PORUM  FERREDOXIN. >pir|A00246|FEPRU ferred...   150  2.7e-21   2
sp|P00234|FER1_EQUTE FERREDOXIN I. >pir|A00240|FEEQ1 ferr...   153  2.8e-21   2
pir|S08122|S08122    ferredoxin [2Fe-2S] I - Synechococcu...   156  3.6e-21   2
pir|S11048|FEKT1     ferredoxin [2Fe-2S] I - Cyanophora p...   155  3.6e-21   2
sp|P06517|FER1_SYNP7 FERREDOXIN I. >pir|A30022|A30022 fer...   156  3.7e-21   2
sp|P17007|FER1_CYAPA FERREDOXIN I.                             155  3.7e-21   2
pir|S28198|S28198    ferredoxin [2Fe-2S] A - giant taro        146  3.7e-21   2
pdb|1FRR|A           Ferredoxin I >pdb|1FRR|B Ferredoxin ...   151  5.2e-21   2
sp|P00224|FER2_SPIOL FERREDOXIN II. >pir|A00231|FESP2 fer...   154  6.9e-21   2
sp|P09735|FER_MARPO  FERREDOXIN. >pir|A24126|FELV ferredo...   149  7.1e-21   2
sp|P15789|FER_CYACA  FERREDOXIN.                               141  1.9e-19   2
sp|P00256|FER_SYNSP  FERREDOXIN. >pir|A00259|FEYCT ferred...   173  2.6e-19   1
sp|P00255|FER_SYNLI  FERREDOXIN. >pir|A00258|FEYCAL ferre...   173  2.7e-19   1
pir|A25761|FEAI      ferredoxin [2Fe-2S] - Anabaena varia...   170  6.7e-19   1
sp|P00254|FER1_ANAVA FERREDOXIN I.                             170  6.7e-19   1
sp|P00240|FER2_DUNSA FERREDOXIN II. >pir|A00244|FEDH2 fer...   148  9.7e-19   2
pir|S25233|S25233    ferredoxin [2Fe-2S] I - Anabaena sp....   168  1.3e-18   1
pdb|1FXA|A           [2Fe-2S] Ferredoxin >pdb|1FXA|B [2Fe...   168  1.3e-18   1
sp|P00253|FER_NOSMU  FERREDOXIN. >pir|A00257|FENM ferredo...   165  3.3e-18   1
sp|P14936|FER1_RAPSA FERREDOXIN, ROOT R-B1. >pir|JX0084|J...   139  3.4e-18   2
sp|P00238|FER_SCEQU  FERREDOXIN. >pir|A00242|FESC ferredo...   144  3.5e-18   2
sp|P00239|FER1_DUNSA FERREDOXIN I. >pir|A00243|FEDH1 ferr...   144  6.2e-18   2
sp|P22341|FER_EUGVI  FERREDOXIN. >pir|S15425|S15425 ferre...   163  6.3e-18   1
sp|P14937|FER2_RAPSA FERREDOXIN, ROOT R-B2. >pir|JX0083|J...   138  1.0e-17   2
gp|U33848|PBU33848_1 PetF1 [Plectonema boryanum]               161  1.1e-17   1
pir|JA0098|JA0098    ferredoxin [2Fe-2S] - Synechococcus sp.   160  2.8e-17   1
sp|P00249|FER2_NOSMU FERREDOXIN II. >pir|A00253|FENM2M fe...   159  5.6e-17   1
pir|S28199|S28199    ferredoxin [2Fe-2S] B - giant taro        136  7.4e-17   2
sp|P00247|FER_CHLFR  FERREDOXIN. >pir|A00251|FEEF ferredo...   158  1.0e-16   1
pir|S00361|FEKM      ferredoxin [2Fe-2S] - Chlamydomonas ...   136  1.9e-16   2
sp|P00252|FER1_NOSMU FERREDOXIN I. >pir|A00256|FENM1M fer...   156  2.8e-16   1
pir|S49989|S49989    2Fe-2S-ferredoxin - Anabaena variabi...   133  2.8e-16   2
sp|P00232|FER2_PHYES FERREDOXIN II. >pir|A00238|FEFW2E fe...   126  2.8e-16   2
sp|P46046|FERH_ANAVA FERREDOXIN, HETEROCYST.                   133  2.8e-16   2
sp|P00222|FER_COLES  FERREDOXIN. >pir|A00229|FETA ferredo...   133  6.7e-16   2
pir|S04543|S04543    ferredoxin [2Fe-2S] - Anabaena sp. (...   130  9.9e-16   2
sp|P00231|FER2_PHYAM FERREDOXIN II. >pir|A00237|FEFW2 fer...   127  1.0e-15   2
pdb|1FRD|            Heterocyst [2fe-2s] Ferredoxin (Oxid...   130  1.0e-15   2
sp|P00248|FER_MASLA  FERREDOXIN. >pir|A00252|FEMW ferredo...   152  1.6e-15   1
sp|P07839|FER_CHLRE  FERREDOXIN PRECURSOR. >gp|L10349|CRE...   136  9.1e-15   2
pdb|3FXC|            Ferredoxin >sp|P00246|FER_SPIPL FERR...   147  1.2e-14   1
sp|P00245|FER_SPIMA  FERREDOXIN. >pir|A00249|FESG ferredo...   146  1.7e-14   1
sp|P07484|FER_RHOPL  FERREDOXIN. >pir|A93760|FEPRR ferred...   146  1.7e-14   1
gp|D64000|SYCSLRB_86 ferredoxin [Synechocystis sp.]            146  1.7e-14   1
sp|P27320|FER_SYNY3  FERREDOXIN.                               146  1.7e-14   1
sp|P00243|FER_SYNY4  FERREDOXIN. >pir|A00247|FEYB6 ferred...   146  1.7e-14   1
sp|P27789|FER5_MAIZE FERREDOXIN V PRECURSOR. >gp|M73828|M...   135  2.1e-14   2
gp|D30794|RICFERRA_1 ferredoxin [Oryza sativa]                 135  2.2e-14   2
sp|P09911|FER1_PEA   FERREDOXIN I PRECURSOR. >pir|S11495|...   130  2.3e-14   2
sp|P11051|FER1_ORYSA FERREDOXIN I.                             144  3.7e-14   1
sp|P00230|FER1_PHYES FERREDOXIN I.                             144  3.7e-14   1
pir|S03730|FERZ      ferredoxin [2Fe-2S] I - rice              144  3.7e-14   1
sp|P00244|FER1_APHFL FERREDOXIN I. >pir|A00248|FEFZ1 ferr...   143  5.2e-14   1
sp|P10770|FER_PERBI  FERREDOXIN. >pir|A30036|FEDQ ferredo...   124  5.3e-14   2
sp|P07838|FER_BRYMA  FERREDOXIN. >pir|S07452|FEYO ferredo...   142  7.5e-14   1
sp|P00233|FER_GLEJA  FERREDOXIN. >pir|A00239|FEFNG ferred...   142  7.7e-14   1
sp|P00225|FER_LEUGL  FERREDOXIN. >pir|A92055|FELG ferredo...   141  1.1e-13   1
sp|P00229|FER1_PHYAM FERREDOXIN I. >pir|A00236|FEFW1 ferr...   140  1.6e-13   1
pir|B00238|FEFWF     ferredoxin [2Fe-2S] I - food pokeweed     140  1.6e-13   1
sp|P27788|FER3_MAIZE FERREDOXIN III PRECURSOR. >gp|M73831...   129  1.9e-13   2
sp|P00226|FER_SAMNI  FERREDOXIN. >pir|A00233|FEED ferredo...   139  2.2e-13   1
sp|P00223|FER_ARCLA  FERREDOXIN. >pir|A00230|FEBQ ferredo...   115  2.3e-13   2
pir|C47673|C47673    ferredoxin [2Fe-2S] - Synechococcus ...   138  3.2e-13   1
sp|P31965|FER1_SYNP2 FERREDOXIN I.                             138  3.2e-13   1
sp|P14938|FER3_RAPSA FERREDOXIN, LEAF L-A.                     138  3.2e-13   1
pir|JX0082|JX0082    ferredoxin [2Fe-2S] A, leaf - radish      138  3.2e-13   1
sp|P00228|FER_WHEAT  FERREDOXIN PRECURSOR. >pir|S37226|FE...   146  3.2e-13   1
sp|P00221|FER1_SPIOL FERREDOXIN I PRECURSOR. >pir|S00437|...   146  3.6e-13   1
gp|D30763|RICFERR_1  ferredoxin [Oryza sativa]                 144  6.4e-13   1
pir|S40169|S40169    FdxH protein - Plectonema boryanum >...   135  9.1e-13   1
sp|P46035|FER2_PLEBO FERREDOXIN II (FDII).                     135  9.2e-13   1
sp|P13106|FER_BUMFI  FERREDOXIN. >pir|A28857|FEBF2 ferred...   135  9.2e-13   1
pir|A61291|A61291    ferredoxin [2Fe-2S] - parsley             133  1.9e-12   1
sp|P00227|FER_BRANA  FERREDOXIN. >pir|A00234|FERP ferredo...   132  2.7e-12   1
sp|P16972|FER_ARATH  FERREDOXIN PRECURSOR. >pir|S09979|S0...   140  3.8e-12   1
pir|S20934|S20934    ferredoxin [2Fe-2S] - Calothrix sp. ...   129  7.4e-12   1
pir|S49996|S49996    2Fe-2S-ferredoxin - Anabaena variabi...   129  7.4e-12   1
sp|P28610|FERH_FREDI FERREDOXIN, HETEROCYST.                   129  7.5e-12   1
sp|P46047|FERV_ANAVA FERREDOXIN, VEGETATIVE.                   129  7.5e-12   1
pir|JA0099|JA0099    ferredoxin [2Fe-2S] - Ochromonas danica   129  7.5e-12   1
pdb|1FXI|A           Ferredoxin I >pdb|1FXI|B Ferredoxin ...   129  7.6e-12   1
sp|P04669|FER_SILPR  FERREDOXIN PRECURSOR. >pir|A23011|FE...   138  7.9e-12   1
sp|P00220|FER_MEDSA  FERREDOXIN. >pir|A00227|FEAA ferredo...   126  2.1e-11   1
sp|P27787|FER1_MAIZE FERREDOXIN I PRECURSOR. >gp|M73829|M...   131  1.1e-10   1
sp|P00251|FER2_APHSA FERREDOXIN II. >pir|A00255|FEAH2 fer...   121  4.1e-10   1
sp|P19734|DMPP_PSEPU PHENOL HYDROXYLASE P5 PROTEIN (EC 1....   104  2.6e-06   1
pir|F37831|F37831    phenol 2-monooxygenase (EC 1.14.13.7...   104  2.6e-06   1
gp|D28864|PSEPHEAA_6 one component of phenol hydroxylase ...   104  2.6e-06   1
pir|S47419|S47419    phenolhydroxylase chain - Pseudomona...   102  5.2e-06   1
pir|S44308|S44308    phenol hydroxylase - Pseudomonas put...   100  1.0e-05   1
sp|P00237|FER2_EQUAR FERREDOXIN II. >pir|B04609|FEEQ2F fe...    92  4.1e-05   1
sp|P00236|FER2_EQUTE FERREDOXIN II. >pir|A00241|FEEQ2 fer...    92  4.1e-05   1
sp|P08451|FER2_SYNP6 FERREDOXIN II.                             92  5.6e-05   1
pir|S10833|FEYC2     ferredoxin [2Fe-2S] II - Synechococc...    92  5.7e-05   1
gp|D31732|PEENIRA_1  nitrite reductase [Plectonema boryanum]    93  0.00012   1


WARNING:  Descriptions of 42 database sequences were not reported due to the
          limiting value of parameter V = 100.



>pir|S35235|S35235 ferredoxin [2Fe-2S] - Halobacterium salinarium
            >gp|X68103|HSFDXG_1 ferredoxin [Halobacterium salinarium]
            Length = 129

 Score = 681 (310.4 bits), Expect = 3.9e-89, P = 3.9e-89
 Identities = 128/128 (100%), Positives = 128/128 (100%)

Query:     1 PTVEYLNYETLDDQGWDMDDDDLFEKAADAGLDGEDYGTMEVAEGEYILEAAEAQGYDWP 60
             PTVEYLNYETLDDQGWDMDDDDLFEKAADAGLDGEDYGTMEVAEGEYILEAAEAQGYDWP
Sbjct:     2 PTVEYLNYETLDDQGWDMDDDDLFEKAADAGLDGEDYGTMEVAEGEYILEAAEAQGYDWP 61

Query:    61 FSCRAGACANCASIVKEGEIDMDMQQILSDEEVEEKDVRLTCIGSPAADEVKIVYNAKHL 120
             FSCRAGACANCASIVKEGEIDMDMQQILSDEEVEEKDVRLTCIGSPAADEVKIVYNAKHL
Sbjct:    62 FSCRAGACANCASIVKEGEIDMDMQQILSDEEVEEKDVRLTCIGSPAADEVKIVYNAKHL 121

Query:   121 DYLQNRVI 128
             DYLQNRVI
Sbjct:   122 DYLQNRVI 129


>sp|P00216|FER_HALHA FERREDOXIN. >pir|A00220|FEHS ferredoxin [2Fe-2S] -
            Halobacterium halobium
            Length = 128

 Score = 681 (310.4 bits), Expect = 3.9e-89, P = 3.9e-89
 Identities = 128/128 (100%), Positives = 128/128 (100%)

Query:     1 PTVEYLNYETLDDQGWDMDDDDLFEKAADAGLDGEDYGTMEVAEGEYILEAAEAQGYDWP 60
             PTVEYLNYETLDDQGWDMDDDDLFEKAADAGLDGEDYGTMEVAEGEYILEAAEAQGYDWP
Sbjct:     1 PTVEYLNYETLDDQGWDMDDDDLFEKAADAGLDGEDYGTMEVAEGEYILEAAEAQGYDWP 60

Query:    61 FSCRAGACANCASIVKEGEIDMDMQQILSDEEVEEKDVRLTCIGSPAADEVKIVYNAKHL 120
             FSCRAGACANCASIVKEGEIDMDMQQILSDEEVEEKDVRLTCIGSPAADEVKIVYNAKHL
Sbjct:    61 FSCRAGACANCASIVKEGEIDMDMQQILSDEEVEEKDVRLTCIGSPAADEVKIVYNAKHL 120

Query:   121 DYLQNRVI 128
             DYLQNRVI
Sbjct:   121 DYLQNRVI 128


>sp|P00217|FER_HALSP FERREDOXIN. >pir|A00221|FEHSX ferredoxin [2Fe-2S] -
            Halobacterium sp.
            Length = 128

 Score = 583 (265.7 bits), Expect = 1.1e-75, P = 1.1e-75
 Identities = 108/128 (84%), Positives = 118/128 (92%)

Query:     1 PTVEYLNYETLDDQGWDMDDDDLFEKAADAGLDGEDYGTMEVAEGEYILEAAEAQGYDWP 60
             PTVEYLNYE +DD GWDM DDD+F +A+D  LD EDYG++EV EGEYILEAAEAQGYDWP
Sbjct:     1 PTVEYLNYEVVDDNGWDMYDDDVFGEASDMDLDDEDYGSLEVNEGEYILEAAEAQGYDWP 60

Query:    61 FSCRAGACANCASIVKEGEIDMDMQQILSDEEVEEKDVRLTCIGSPAADEVKIVYNAKHL 120
             FSCRAGACANCA+IV EG+IDMDMQQILSDEEVE+K+VRLTCIGSP ADEVKIVYNAKHL
Sbjct:    61 FSCRAGACANCAAIVLEGDIDMDMQQILSDEEVEDKNVRLTCIGSPDADEVKIVYNAKHL 120

Query:   121 DYLQNRVI 128
             DYLQNRVI
Sbjct:   121 DYLQNRVI 128


>sp|P15788|FER_SYNP4 FERREDOXIN. >pir|A28858|A28858 ferredoxin [2Fe-2S] -
            Synechococcus sp.
            Length = 98

 Score = 176 (80.2 bits), Expect = 1.7e-25, Sum P(2) = 1.7e-25
 Identities = 31/56 (55%), Positives = 41/56 (73%)

Query:    39 TMEVAEGEYILEAAEAQGYDWPFSCRAGACANCASIVKEGEIDMDMQQILSDEEVE 94
             T+EV + EYIL+ AE +G D P+SCRAGAC+ CA  +KEGEID   Q  L D+++E
Sbjct:    17 TIEVPDDEYILDVAEEEGIDLPYSCRAGACSTCAGKIKEGEIDQSDQSFLDDDQIE 72

 Score = 45 (20.5 bits), Expect = 1.7e-25, Sum P(2) = 1.7e-25
 Identities = 11/35 (31%), Positives = 17/35 (48%)

Query:    86 QILSDEEVEEKDVRLTCIGSPAADEVKIVYNAKHL 120
             Q   D++  E    LTC+  PA+D   I +  + L
Sbjct:    63 QSFLDDDQIEAGYVLTCVAYPASDCTIITHQEEEL 97


>sp|P00241|FER_GALSU FERREDOXIN. >pir|A00245|FEKK ferredoxin [2Fe-2S] - red
            alga (Cyanidium caldarium)
            Length = 98

 Score = 162 (73.8 bits), Expect = 5.7e-22, Sum P(2) = 5.7e-22
 Identities = 29/56 (51%), Positives = 40/56 (71%)

Query:    39 TMEVAEGEYILEAAEAQGYDWPFSCRAGACANCASIVKEGEIDMDMQQILSDEEVE 94
             T+E  + +YIL+AAE QG D P+SCRAGAC+ CA  + EGE+D   Q  L D++V+
Sbjct:    17 TIECPDDQYILDAAEEQGLDLPYSCRAGACSTCAGKLLEGEVDQSDQSFLDDDQVK 72

 Score = 33 (15.0 bits), Expect = 5.7e-22, Sum P(2) = 5.7e-22
 Identities = 7/35 (20%), Positives = 16/35 (45%)

Query:    86 QILSDEEVEEKDVRLTCIGSPAADEVKIVYNAKHL 120
             Q   D++  +    LTC+  P ++   + +  + L
Sbjct:    63 QSFLDDDQVKAGFVLTCVAYPTSNATILTHQEESL 97



.........................................................
.................Material Deleted........................
.........................................................




>gp|D64000|SYCSLRB_86 ferredoxin [Synechocystis sp.]
            Length = 97

 Score = 146 (66.6 bits), Expect = 1.7e-14, P = 1.7e-14
 Identities = 25/56 (44%), Positives = 37/56 (66%)

Query:    39 TMEVAEGEYILEAAEAQGYDWPFSCRAGACANCASIVKEGEIDMDMQQILSDEEVE 94
             ++E ++  YIL+AAE  G D P+SCRAGAC+ CA  +  G +D   Q  L D+++E
Sbjct:    16 SIECSDDTYILDAAEEAGLDLPYSCRAGACSTCAGKITAGSVDQSDQSFLDDDQIE 71


>sp|P27320|FER_SYNY3 FERREDOXIN.
            Length = 96

 Score = 146 (66.6 bits), Expect = 1.7e-14, P = 1.7e-14
 Identities = 25/56 (44%), Positives = 37/56 (66%)

Query:    39 TMEVAEGEYILEAAEAQGYDWPFSCRAGACANCASIVKEGEIDMDMQQILSDEEVE 94
             ++E ++  YIL+AAE  G D P+SCRAGAC+ CA  +  G +D   Q  L D+++E
Sbjct:    15 SIECSDDTYILDAAEEAGLDLPYSCRAGACSTCAGKITAGSVDQSDQSFLDDDQIE 70


WARNING:  HSPs involving 92 database sequences were not reported due to the
          limiting value of parameter B = 50.

Parameters:
  V=100
  B=50
  H=1
  -qtype

  -ctxfactor=1.00
  E=10

  Query                        -----  As Used  -----    -----  Computed  ----
  Frame  MatID Matrix name     Lambda    K       H      Lambda    K       H
   +0      0   BLOSUM62        0.316   0.136   0.401    same    same    same

  Query
  Frame  MatID  Length  Eff.Length   E    S W   T  X     E2  S2
   +0      0      128       128      10. 59 3  11 22    0.22 31

Statistics:
  Query          Expected         Observed           HSPs       HSPs
  Frame  MatID  High Score       High Score       Reportable  Reported
   +0      0    63 (28.7 bits)  681 (310.4 bits)      205         84

  Query         Neighborhd  Word      Excluded    Failed   Successful  Overlaps
  Frame  MatID   Words      Hits        Hits    Extensions Extensions  Excluded
   +0      0      3526    13602898     2826310    10729133    47445       547

  Database:  Non-redundant PDB+SwissProt+SPupdate+PIR+GenPept+GPupdate
    Release date:  6:03 AM EST Feb 3, 1996
    Posted date:  6:04 AM EST Feb 3, 1996
  # of letters in database:  51,502,515
  # of sequences in database:  173,745
  # of database sequences satisfying E:  142
  No. of states in DFA:  531 (52 KB)
  Total size of DFA:  91 KB (128 KB)
  Time to generate neighborhood:  0.02u 0.01s 0.03t  Real: 00:00:00
  No. of processors used:  8
  Time to search database:  62.74u 1.36s 64.10t  Real: 00:00:13
  Total cpu time:  62.83u 1.47s 64.30t  Real: 00:00:13

WARNINGS ISSUED:  2




BLAST format

Again the BLAST output begins with a histogram though in this case it is rather superfluous. Following FASTA format the program then reports a short listing of the highest scores and then a more detailed listing of the matches. The latter includes the sequence name (along with an accession number and the source database). Then the raw score (along with an estimate of the amount of information in the query sequence), the expected number of equal or better matches, a "Sum" probability, the number of identical matches and the number of positively scoring matches. The "Sum" probability is calculated when there are two or more blocks of high scoring identity per query sequence (remember BLAST is not permitting gaps in its results). The "Sum" P(n)-value is the probability that at least n or more such blocks would be found by chance within the query sequence and that each block would have a score at least as good as the poorest score. (This is an approximate statistic since incompatible blocks may be counted as independent).

The BLAST algorithm is capable of speeding through the entire amino acid database within 13 seconds for this query. Quite an improvement over FASTA but then it is not doing as much. BLAST is probably not as sensitive for non-coding nucleotide sequences due to the high probability of small insertions / deletions (indels) that will occur in such data.


BLITZ

The BLITZ server uses the Smith-Waterman local similarity algorithm (see the section on alignments) to compare the query sequence versus the Swiss-Prot database (they plan to make more databases available in the near future). The advantage of this algorithm, termed MPsrch, is mainly that it is running on a MasPar MP-1 computer (a "massively" parallel computer) with 4096 processors. Because of the use of a parallel computer, "MPsrch is the fastest implementation of the SW algorithm currently available on any machine". The implementation is due to S.S.Sturrock and J.F.Collins (1993, MPsrch version 1.3, Biocomputing Research Unit, University of Edinburgh, UK (remember to quote the authors of any search algorithm you use)).

The input format for a BLITZ search is simply

TITLE HALHA FER
SEQ
PTVEYLNYETLDDQGWDMDDDDLFEKAADAGLDGEDYGTMEVAEGEYILEAAEAQGYDWP
FSCRAGACANCASIVKEGEIDMDMQQILSDEEVEEKDVRLTCIGSPAADEVKIVYNAKHL
DYLQNRVI

and only SEQ is mandatory, even the TITLE is optional. The other options are PAM n (sets the PAM n scoring matrix for comparison of proteins [120]), INDEL n (sets the penalty for indels and gaps [default is dependant on the PAM matrix chosen, 13 for PAM120]), ALIGN n (number of best alignments presented [30]), NAMES n (number of scores to report). Mail the file to Blitz@ebi.ac.uk and the results are mailed back to you.


BLITZ output

The output generated by this mail message is ...



Search started: Sat Feb  3 21:12:49 1996

MPsrch:         Version 1.5 - Shane S. Sturrock & John F. Collins 1993.
                Biocomputing Research Unit, University of Edinburgh, UK.

Execution:      MasPar time 34.07 Seconds at EMBL, Heidelberg, Germany
                65.325 Million cell updates/sec

Title:          HALHA
Description:    FER

Sequence:       1 PTVEYLNYETLDDQGWDMDD..........DEVKIVYNAKHLDYLQNRVI 128

Parameters:     swissprot (49340 seqs, 17385503 residues)
                PAM 120;  Penalty 13;  Perfect Score 1009;  Align 30

Predicted No. is the number of results expected by chance to have a score
greater than or equal to the score of the result being printed, and is
derived by analysis of the total score distribution which gave:

Statistics:     Mean 41.196;  Variance 64.575;  scale 0.638

 No.   Score  %Match  Length  ID          Description                 Pred. No.
--------------------------------------------------------------------------------
   1    1009   100.0     128  FER_HALHA   FERREDOXIN.                 0.00e+00
   2     877    86.9     128  FER_HALSP   FERREDOXIN.                 0.00e+00
   3     291    28.8      98  FER_SYNP4   FERREDOXIN.                 0.00e+00
   4     281    27.8      98  FER1_ANAVA  FERREDOXIN I.               0.00e+00
   5     279    27.7      98  FER1_ANASP  FERREDOXIN I.               0.00e+00
   6     277    27.5      97  FER_SYNSP   FERREDOXIN.                 0.00e+00
   7     275    27.3      98  FER_NOSMU   FERREDOXIN.                 6.91e-41
   8     274    27.2      96  FER_SYNLI   FERREDOXIN.                 6.91e-41
   9     271    26.9      98  FER2_NOSMU  FERREDOXIN II.              4.84e-40
  10     261    25.9      98  FER1_NOSMU  FERREDOXIN I.               1.06e-37
  11     257    25.5      95  FER_MARPO   FERREDOXIN.                 9.22e-37
  12     256    25.4      96  FER_EUGVI   FERREDOXIN.                 1.58e-36
  13     256    25.4      98  FER_CHLFR   FERREDOXIN.                 1.58e-36
  14     252    25.0      98  FER1_SYNP7  FERREDOXIN I.               1.37e-35
  15     251    24.9      98  FER_GALSU   FERREDOXIN.                 2.34e-35
  16     249    24.7      98  FER_MASLA   FERREDOXIN.                 6.86e-35
  17     248    24.6      98  FER1_CYAPA  FERREDOXIN I.               1.17e-34
  18     243    24.1      97  FER_RHOPL   FERREDOXIN.                 1.71e-33
  19     242    24.0      98  FER_SPIPL   FERREDOXIN.                 2.92e-33
  20     242    24.0      97  FER2_SPIOL  FERREDOXIN II.              2.92e-33
  21     242    24.0      95  FER1_EQUTE  FERREDOXIN I.               2.92e-33
  22     240    23.8      98  FER_PORUM   FERREDOXIN.                 8.49e-33
  23     239    23.7      96  FER_SYNY3   FERREDOXIN.                 1.45e-32
  24     239    23.7      98  FER_SPIMA   FERREDOXIN.                 1.45e-32
  25     239    23.7      95  FER1_EQUAR  FERREDOXIN I.               1.45e-32
  26     238    23.6      96  FER_SYNY4   FERREDOXIN.                 2.47e-32
  27     236    23.4      95  FER2_DUNSA  FERREDOXIN II.              7.15e-32
  28     234    23.2      97  FER1_APHFL  FERREDOXIN I.               2.07e-31
  29     233    23.1      96  FER1_PHYES  FERREDOXIN I.               3.52e-31
  30     232    23.0      96  FER1_SYNP2  FERREDOXIN I.               5.97e-31
  31     232    23.0      95  FER_GLEJA   FERREDOXIN.                 5.97e-31
  32     232    23.0      98  FER_BRYMA   FERREDOXIN.                 5.97e-31
  33     230    22.8      96  FER1_PHYAM  FERREDOXIN I.               1.72e-30
  34     229    22.7     147  FER1_SPIOL  FERREDOXIN I PRECURSOR.     2.92e-30
  35     228    22.6      96  FER_SCEQU   FERREDOXIN.                 4.95e-30
  36     227    22.5      97  FER_CYACA   FERREDOXIN.                 8.40e-30
  37     226    22.4      98  FER2_RAPSA  FERREDOXIN, ROOT R-B2.      1.42e-29
  38     225    22.3      96  FER1_ORYSA  FERREDOXIN I.               2.41e-29
  39     225    22.3      98  FER1_RAPSA  FERREDOXIN, ROOT R-B1.      2.41e-29
  40     224    22.2     143  FER_WHEAT   FERREDOXIN PRECURSOR.       4.08e-29
  41     223    22.1      98  FER_BUMFI   FERREDOXIN.                 6.90e-29
  42     223    22.1     135  FER5_MAIZE  FERREDOXIN V PRECURSOR.     6.90e-29
  43     221    21.9      95  FER1_DUNSA  FERREDOXIN I.               1.97e-28
  44     218    21.6     126  FER_CHLRE   FERREDOXIN PRECURSOR.       9.49e-28
  45     218    21.6     148  FER_ARATH   FERREDOXIN PRECURSOR.       9.49e-28
  46     218    21.6     146  FER_SILPR   FERREDOXIN PRECURSOR.       9.49e-28
  47     217    21.5     152  FER3_MAIZE  FERREDOXIN III PRECURSOR.   1.60e-27
  48     216    21.4      96  FER_LEUGL   FERREDOXIN.                 2.70e-27
  49     215    21.3      96  FER_APHSA   FERREDOXIN I.               4.55e-27
  50     215    21.3      96  FER3_RAPSA  FERREDOXIN, LEAF L-A.       4.55e-27


RESULT    1     Score 1009;  Match 0.0%;  Predicted No. 0.00e+00;

ID   FER_HALHA      STANDARD;      PRT;   128 AA.
DE   FERREDOXIN.

          Matches 128;  Mismatches 0;  Partials 0;  Indels 0;  Gaps 0;

          ************************************************************
Db      1 PTVEYLNYETLDDQGWDMDDDDLFEKAADAGLDGEDYGTMEVAEGEYILEAAEAQGYDWP 60
Qy      1 PTVEYLNYETLDDQGWDMDDDDLFEKAADAGLDGEDYGTMEVAEGEYILEAAEAQGYDWP 60

          ************************************************************
Db     61 FSCRAGACANCASIVKEGEIDMDMQQILSDEEVEEKDVRLTCIGSPAADEVKIVYNAKHL 120
Qy     61 FSCRAGACANCASIVKEGEIDMDMQQILSDEEVEEKDVRLTCIGSPAADEVKIVYNAKHL 120

          ********
Db    121 DYLQNRVI 128
Qy    121 DYLQNRVI 128


RESULT    2     Score 877;  Match 0.0%;  Predicted No. 0.00e+00;

ID   FER_HALSP      STANDARD;      PRT;   128 AA.
DE   FERREDOXIN.

          Matches 108;  Mismatches 11;  Partials 9;  Indels 0;  Gaps 0;

          ********* .** **** ***.*  *.*  ** ****..** *****************
Db      1 PTVEYLNYEVVDDNGWDMYDDDVFGEASDMDLDDEDYGSLEVNEGEYILEAAEAQGYDWP 60
Qy      1 PTVEYLNYETLDDQGWDMDDDDLFEKAADAGLDGEDYGTMEVAEGEYILEAAEAQGYDWP 60

          ************.** **.***************.*.********* *************
Db     61 FSCRAGACANCAAIVLEGDIDMDMQQILSDEEVEDKNVRLTCIGSPDADEVKIVYNAKHL 120
Qy     61 FSCRAGACANCASIVKEGEIDMDMQQILSDEEVEEKDVRLTCIGSPAADEVKIVYNAKHL 120

          ********
Db    121 DYLQNRVI 128
Qy    121 DYLQNRVI 128


RESULT    3     Score 291;  Match 0.0%;  Predicted No. 0.00e+00;

ID   FER_SYNP4      STANDARD;      PRT;    98 AA.
DE   FERREDOXIN.

          Matches 38;  Mismatches 17;  Partials 15;  Indels 1;  Gaps 1;

          *.**.. ****. ** .* * *.*******. **. .******   *  * *...*   *
Db     17 TIEVPDDEYILDVAEEEGIDLPYSCRAGACSTCAGKIKEGEIDQSDQSFLDDDQIEAGYV 76
Qy     39 TMEVAEGEYILEAAEAQGYDWPFSCRAGACANCASIVKEGEIDMDMQQILSDEEVEEKDV 98

           ***.. **.*
Db     77 -LTCVAYPASD 86
Qy     99 RLTCIGSPAAD 109


RESULT    4     Score 281;  Match 0.0%;  Predicted No. 0.00e+00;

ID   FER1_ANAVA     STANDARD;      PRT;    98 AA.
DE   FERREDOXIN I.

          Matches 38;  Mismatches 19;  Partials 16;  Indels 2;  Gaps 2;

          *..*.. ****.*** **** *********. **. .  * .*   *  * *...*   *
Db     17 TIDVPDDEYILDAAEEQGYDLPFSCRAGACSTCAGKLVSGTVDQSDQSFLDDDQIEAGYV 76
Qy     39 TMEVAEGEYILEAAEAQGYDWPFSCRAGACANCASIVKEGEIDMDMQQILSDEEVEEKDV 98

           ***.. *..* * *
Db     77 -LTCVAYPTSD-VTI 89
Qy     99 RLTCIGSPAADEVKI 113


RESULT    5     Score 279;  Match 0.0%;  Predicted No. 0.00e+00;

ID   FER1_ANASP     STANDARD;      PRT;    98 AA.
DE   FERREDOXIN I.

          Matches 37;  Mismatches 19;  Partials 15;  Indels 1;  Gaps 1;

          .**.. ****.*** **** *********. **. .  * .*   *  * *...*   * 
Db     18 IEVPDDEYILDAAEEQGYDLPFSCRAGACSTCAGKLVSGTVDQSDQSFLDDDQIEAGYV- 76
Qy     40 MEVAEGEYILEAAEAQGYDWPFSCRAGACANCASIVKEGEIDMDMQQILSDEEVEEKDVR 99

          ***.. *..* *
Db     77 LTCVAYPTSDVV 88
Qy    100 LTCIGSPAADEV 111


RESULT    6     Score 277;  Match 0.0%;  Predicted No. 0.00e+00;

ID   FER_SYNSP      STANDARD;      PRT;    97 AA.
DE   FERREDOXIN.

          Matches 39;  Mismatches 21;  Partials 15;  Indels 2;  Gaps 2;

          ** .  *..*.* ****. ** ** * *********. **. . ***.*   *  * *..
Db     11 DGSE-TTIDVPEDEYILDVAEEQGLDLPFSCRAGACSTCAGKLLEGEVDQSDQSFLDDDQ 69
Qy     33 DGEDYGTMEVAEGEYILEAAEAQGYDWPFSCRAGACANCASIVKEGEIDMDMQQILSDEE 92

          . **   ***.. * .*
Db     70 I-EKGFVLTCVAYPRSD 85
Qy     93 VEEKDVRLTCIGSPAAD 109


RESULT    7     Score 275;  Match 0.0%;  Predicted No. 6.91e-41;

ID   FER_NOSMU      STANDARD;      PRT;    98 AA.
DE   FERREDOXIN.

          Matches 36;  Mismatches 19;  Partials 16;  Indels 1;  Gaps 1;

          .**.. ****.*** .*** *********. **. .  * .*   *  * *...*   * 
Db     18 IEVPDDEYILDAAEEEGYDLPFSCRAGACSTCAGKLVSGTVDQSDQSFLDDDQIEAGYV- 76
Qy     40 MEVAEGEYILEAAEAQGYDWPFSCRAGACANCASIVKEGEIDMDMQQILSDEEVEEKDVR 99

          ***.. *..* *
Db     77 LTCVAYPTSDVV 88
Qy    100 LTCIGSPAADEV 111


..................................................
...............Material Deleted ..................
..................................................


RESULT   29     Score 233;  Match 0.0%;  Predicted No. 3.52e-31;

ID   FER1_PHYES     STANDARD;      PRT;    96 AA.
DE   FERREDOXIN I.

          Matches 32;  Mismatches 22;  Partials 19;  Indels 2;  Gaps 2;

          *.. ..  *.*.***  * * *.*****.*..**. *  * .* . *  * *...*   *
Db     15 TIDCPDDTYVLDAAEEAGLDLPYSCRAGSCSSCAGKVTAGTVDQEDQSFLDDDQIEAGFV 74
Qy     39 TMEVAEGEYILEAAEAQGYDWPFSCRAGACANCASIVKEGEIDMDMQQILSDEEVEEKDV 98

           ***.. * .* * *
Db     75 -LTCVAYPKGD-VTI 87
Qy     99 RLTCIGSPAADEVKI 113


RESULT   30     Score 232;  Match 0.0%;  Predicted No. 5.97e-31;

ID   FER1_SYNP2     STANDARD;      PRT;    96 AA.
DE   FERREDOXIN I.

          Matches 33;  Mismatches 24;  Partials 14;  Indels 2;  Gaps 2;

          . .. ****..*   *** * *******. **. .  * .*   *  * *...*   * *
Db     17 DAPDDEYILDSAGDAGYDLPASCRAGACSTCAGKIVSGTVDQSEQSFLDDDQIEAGYV-L 75
Qy     41 EVAEGEYILEAAEAQGYDWPFSCRAGACANCASIVKEGEIDMDMQQILSDEEVEEKDVRL 100

          ***. * .* * *
Db     76 TCIAYPQSD-VTI 87
Qy    101 TCIGSPAADEVKI 113

Search completed: Sat Feb  3 21:13:27 1996



BLITZ format

This particular search took only 34 seconds, again a great improvement over the FASTA approach. While not as fast as BLAST, it should (so they say) give a more sensitive search for distant homologies. The mean and variance of the distribution of scores from the entire database are calculated. These are used to construct empirical statistics of the predicted number of random matches in the database equal to or better than that found. The algorithm then lists the best scores (50 of them here, the default for NAMES) and then lists more detailed reports for a subclass of these (30 here, the default for ALIGN). For each it calculates the raw score, the percent matches, the predicted number expected, the number of matches, the number of mismatches, the number of partial matches (residue pairs with a positive score in the PAM matrix), the number of indels and the number of gaps. This program considers these two differently in that a single gap can be composed of any number of adjacent indels.

In this case all 50 hits have very small expected numbers indicating that they each have statistically significant homology to the ferredoxin query sequence (not too surprising since they are all different ferredoxins). These statistics are essentially extreme value statistics. Also note that the Smith-Waterman alignment algorithm does a best local alignment (more on this later) so the entire query sequence may not be presented in the output. In this case since it does permit the presence of gaps in the sequence only one match per sequence is recorded.


BLAZE

BLAZE came along before BLAST and, to my knowledge, predated BLITZ. It is another implementation for a parallel computer system. This one is operated by Intelligenetics also on a MasPar MP-1. Although Intelligenetics is no longer in the general ethernet "cyberspace" it is still operating this program for a real money cost. Intelligenetics operate a computer upon which time can be bought. I include it for interests sake but also to note that nucleotide searches as well as protein searches were permitted on this machine (as of July 92). It was claimed that the algorithm could work through Swiss-Prot with a query of 100 amino acids in 15 seconds.


FLASH

FLASH (Fast-Lookup Algorithm for Sequence Homology) uses a different approach and concept. This is an IBM (Thomas J. Watson Research Center at Yorktown Heights) project lead by A.Califano and I.Rigoutsos (reference the Proceedings of the First Intl. Conf. on Intelligent Systems for Mol. Biol., July 1993, Bethesda, MD). The algorithm makes use of an object-recognition technique borrowed from computer vision technology. Because it is using a "lookup" algorithm from a preset table of indexed patterns its speed should not be greatly degraded by increases in the size of the database.

The researchers at IBM claim that this is part of a general class of algorithms that can be used to search very large databases for diverse information including finding molecules of similar shape or structure, text searches and visual object recognition. The major difference in the algorithm is in its "hash" table. Like many other search algorithms it uses a "hash" table for speed but this lacks the sensitivity required. But this algorithm constructs a series of non-contiguous k-tuples. By constructing these in a precisely defined manner the amount of work creating the table is increased but the sensitivity is increased since there are many more k-tuples per sequence and they are not sensitive to the odd mismatch. The searches are implemented on seven NON-dedicated IBM/RS6000 machines (what else were you expecting).

There are several features of this server that are unusual in comparison to the others. Requests send to the server must contain a "Subject" line with the word dFLASH or it will be ignored. Most other servers will ignore a "Subject" line or want it to be blank. The server input message must have a BEGIN line, must include a "title" line (beginning with ">") and must have a terminating "1".

The typical input message would be

BLOSUM 62
BEGIN
>HALHA FER  # mandatory title
PTVEYLNYETLDDQGWDMDDDDLFEKAADAGLDGEDYGTMEVAEGEYILEAAEAQGYDWP
FSCRAGACANCASIVKEGEIDMDMQQILSDEEVEEKDVRLTCIGSPAADEVKIVYNAKHL
DYLQNRVI
1

A PAM or BLOSUM matrix can be chosen and other options include ALIGNMENTS n (sets the number of alignments to report, up to 10000), THRESHOLD n (score must be greater than n to be reported [30]), PENALTY, VERBOSE, SEQUENCES, KEY XMATCH, SOURCE PROTEIN, TARGET SP. The query sequence must be less than 1500 characters.

This should be mailed to dflash@watson.ibm.com.


FLASH output

The above mail message yielded output as follows two years ago (a test of the system in the last week indicated that the server is still present but it is obviously not functioning in real time - so you have to put up with two year old data).

ELAPSED WALL TIME: 18.0 secs
TOTAL CPU TIME OVER ALL SERVERS: 13.4 secs
Alignments for sequence: HALHA-FER--#-MANDATORY-TITLE
PTVEYLNYETLDDQGWDMDDDDLFEKAADAGLDGEDYGTMEVAEGEYILEAAEAQGYDWPFSCRAGACAN
CASIVKEGEIDMDMQQILSDEEVEEKDVRLTCIGSPAADEVKIVYNAKHLDYLQNRVI

Score Matrix: BLOSUM62
Max Reported Alignments: 10000
Score Threshold  At: 30

  Id                 Label    NRes   Score   NrmSc  Match%    Peak NrmPeak
----------------------------------------------------------------------------
   0.            FER_HALHA     128     677     127    100%     125     210
   1.            FER_HALSP     128     579     106     84%     125     209
   2.            FER_SYNSP      87     205      39     48%      90     158
   3.            FER_SYNP4      71     199      37     52%      90     157
   4.            FER_SYNLI      71     196      36     52%      90     158
   5.           FER1_ANAVA      71     189      34     49%      98     179
   6.           FER1_ANASP      72     189      35     50%      98     179
   7.            FER_NOSMU      72     186      34     48%      95     173
   8.           FER1_NOSMU      71     180      33     46%      86     149
   9.            FER_CHLFR      78     178      34     44%      81     151
  10.           FER1_SYNP7      71     178      34     47%      87     163
  11.           FER1_CYAPA      71     178      33     49%      87     163
  12.           FER1_CYACA      71     176      32     46%      87     163
  13.           FER2_NOSMU      77     174      32     42%      80     148
  14.           FER1_RAPSA      94     173      32     41%      81     151
  15.            FER_EUGVI      71     172      32     46%      90     157
  16.            FER_MASLA      86     171      31     44%      81     151
  17.           FER2_SPIOL      70     171      31     48%      84     155
  18.            FER_MARPO      70     168      31     45%      79     141
  19.            FER_SYNY4      77     167      32     42%      81     151
  20.           FER1_EQUTE      72     167      30     48%      93     167
  21.            FER_SPIPL      71     166      30     43%      81     151
  22.           FER2_RAPSA      81     166      31     41%      81     151
  23.            FER_SYNY3      71     165      30     42%      81     151
  24.            FER_SPIMA      71     165      30     43%      81     151
  25.            FER_RHOPL      70     165      30     44%      87     154
  26.           FER1_EQUAR      72     165      30     48%      93     167
  27.            FER_PORUM      69     163      30     46%      80     150
  28.            FER_GLEJA      71     163      30     42%      78     141
  29.            FER_SCEQU      71     162      29     43%      82     151
  30.           FER1_APHFL      71     162      29     45%      81     151
................................................................................
...............................Material deleted.................................
................................................................................
 112.           VG56_HSVI1      26      31       7     34%      26      60
 113.           NUPL_XENLA      23      31       6     26%      30      64
 114.           YJAC_ECOLI      32      30       7     31%      26      60

-----------------------------------------------------------------------------
1. FER_HALHA
              Abs. Alignment: 2934561
                 N. Residues: 128
              Sequence Score: 677
            Normalized Score: 127
              Score/Residues: 5.28906
               Exact Matches: 128
                Exact Match%: 100
        Conservative Matches: 0
       Conservative Matches%: 0
              Total Matches%: 100
                  Mismatches: 0
                  Peak Score: 125
       Normalized Peak Score: 21

     1                   21                  41                  61
     PTVEYLNYETLDDQGWDMDDDDLFEKAADAGLDGEDYGTMEVAEGEYILEAAEAQGYDWPFSCRAGACAN
     PTVEYLNYETLDDQGWDMDDDDLFEKAADAGLDGEDYGTMEVAEGEYILEAAEAQGYDWPFSCRAGACAN
     PTVEYLNYETLDDQGWDMDDDDLFEKAADAGLDGEDYGTMEVAEGEYILEAAEAQGYDWPFSCRAGACAN
     1                   21                  41                  61

     71
     CASIVKEGEIDMDMQQILSDEEVEEKDVRLTCIGSPAADEVKIVYNAKHLDYLQNRV
     CASIVKEGEIDMDMQQILSDEEVEEKDVRLTCIGSPAADEVKIVYNAKHLDYLQNRV
     CASIVKEGEIDMDMQQILSDEEVEEKDVRLTCIGSPAADEVKIVYNAKHLDYLQNRV
     71

-----------------------------------------------------------------------------
2. FER_HALSP
              Abs. Alignment: 2934700
                 N. Residues: 128
              Sequence Score: 579
            Normalized Score: 106.883
              Score/Residues: 4.52344
               Exact Matches: 108
                Exact Match%: 84
        Conservative Matches: 10
       Conservative Matches%: 7
              Total Matches%: 92
                  Mismatches: 10
                  Peak Score: 125
       Normalized Peak Score: 21

     1                   21                  41                  61
     PTVEYLNYETLDDQGWDMDDDDLFEKAADAGLDGEDYGTMEVAEGEYILEAAEAQGYDWPFSCRAGACAN
     PTVEYLNYE +DD GWDM DDD+F +A+D  LD EDYG++EV EGEYILEAAEAQGYDWPFSCRAGACAN
     PTVEYLNYEVVDDNGWDMYDDDVFGEASDMDLDDEDYGSLEVNEGEYILEAAEAQGYDWPFSCRAGACAN
     1                   21                  41                  61

     71
     CASIVKEGEIDMDMQQILSDEEVEEKDVRLTCIGSPAADEVKIVYNAKHLDYLQNRV
     CA+IV EG+IDMDMQQILSDEEVE+K+VRLTCIGSP ADEVKIVYNAKHLDYLQNRV
     CAAIVLEGDIDMDMQQILSDEEVEDKNVRLTCIGSPDADEVKIVYNAKHLDYLQNRV
     71

-----------------------------------------------------------------------------
3. FER_SYNSP
              Abs. Alignment: 2937680
                 N. Residues: 87
              Sequence Score: 205
            Normalized Score: 39.7063
              Score/Residues: 2.35632
               Exact Matches: 42
                Exact Match%: 48
        Conservative Matches: 14
       Conservative Matches%: 16
              Total Matches%: 64
                  Mismatches: 31
                  Peak Score: 90
       Normalized Peak Score: 15.8253

     33                  53                  73                  93
     DGEDYGTMEVAEGEYILEAAEAQGYDWPFSCRAGACANCASIVKEGEIDMDMQQILSDEEVEEKDVRLTC
     DG +  T++V E EYIL+ AE QG D PFSCRAGAC+ CA  + EGE+D   Q  L D+++E K   LTC
     DGSE-TTIDVPEDEYILDVAEEQGLDLPFSCRAGACSTCAGKLLEGEVDQSDQSFLDDDQIE-KGFVLTC
     11                  30                  50                  70

     103
     IGSPAADEVKIVYNAKH
     +  P +D  KI+ N +
     VAYPRSD-CKILTNQEE
     79

-----------------------------------------------------------------------------
4. FER_SYNP4
              Abs. Alignment: 2937572
                 N. Residues: 71
              Sequence Score: 199
            Normalized Score: 37.642
              Score/Residues: 2.80282
               Exact Matches: 37
                Exact Match%: 52
        Conservative Matches: 12
       Conservative Matches%: 16
              Total Matches%: 69
                  Mismatches: 22
                  Peak Score: 90
       Normalized Peak Score: 15.75

     39                  59                  79                  99
     TMEVAEGEYILEAAEAQGYDWPFSCRAGACANCASIVKEGEIDMDMQQILSDEEVEEKDVRLTCIGSPAA
     T+EV + EYIL+ AE +G D P+SCRAGAC+ CA  +KEGEID   Q  L D+++E     LTC+  PA+
     TIEVPDDEYILDVAEEEGIDLPYSCRAGACSTCAGKIKEGEIDQSDQSFLDDDQIE-AGYVLTCVAYPAS
     17                  37                  57                  76

     109
     D
     D
     D
     86

-----------------------------------------------------------------------------
5. FER_SYNLI
              Abs. Alignment: 2937463
                 N. Residues: 71
              Sequence Score: 196
            Normalized Score: 36.542
              Score/Residues: 2.76056
               Exact Matches: 37
                Exact Match%: 52
        Conservative Matches: 11
       Conservative Matches%: 15
              Total Matches%: 67
                  Mismatches: 23
                  Peak Score: 90
       Normalized Peak Score: 15.8253

     39                  59                  79                  99
     TMEVAEGEYILEAAEAQGYDWPFSCRAGACANCASIVKEGEIDMDMQQILSDEEVEEKDVRLTCIGSPAA
     T++V E EYIL+ AE QG D PFSCRAGAC+ CA  + EGE+D   Q  L D+++E K   LTC+  P +
     TIDVPEDEYILDVAEEQGLDLPFSCRAGACSTCAGKLLEGEVDQSDQSFLDDDQIE-KGFVLTCVAYPRS
     15                  35                  55                  74

     109
     D
     D
     D
     84

.................................................................
................Lots of Material deleted.........................
.................................................................

-----------------------------------------------------------------------------
114. NUPL_XENLA
              Abs. Alignment: 5912326
                 N. Residues: 23
              Sequence Score: 31
            Normalized Score: 6.73889
              Score/Residues: 1.34783
               Exact Matches: 6
                Exact Match%: 26
        Conservative Matches: 8
       Conservative Matches%: 34
              Total Matches%: 60
                  Mismatches: 9
                  Peak Score: 30
       Normalized Peak Score: 6.40556

     81                  101                 32
     DMDMQQILSDEEVEEKDVRLTCI
     + ++ +I++ EE  EK V +  +
     EFNIVEIVTQEEGAEKSVPIATL
     59                  79                  2

-----------------------------------------------------------------------------
115. YJAC_ECOLI
              Abs. Alignment: 10405998
                 N. Residues: 32
              Sequence Score: 30
            Normalized Score: 7.21032
              Score/Residues: 0.9375
               Exact Matches: 10
                Exact Match%: 31
        Conservative Matches: 5
       Conservative Matches%: 15
              Total Matches%: 46
                  Mismatches: 17
                  Peak Score: 26
       Normalized Peak Score: 6.09365

     86                  106                 -2
     QILSDEEVEEKDVRLTCIGSPAADEVKIVYNA
     + LS +      V    + +  AD +KIV NA
     KFLSAKNRTSSHVLYHVMANGDADMLKIVLNA
     362                 382                 277


FLASH format

As can be seen the FLASH output can be quite large. It is best to include the ALIGNMENTS parameter (after the score matrix) to limit output (to something less than the 10000 - the old default, which would blow your computer anyway).

While the output produces a great deal of information on the scores and matches it does not really provide any statistical evaluation of the results. This is a major disadvantage of this server in comparison to the others. It also appears to have difficulty with the sequence numbering beyond the end of the sequence.

In terms of speed the 18 seconds reported at the top of the mail message compares favourably with the BLAST and BLITZ servers but doesn't quite live up to the advertising hype. The authors claim that it should best BLAST by two (count em - two) orders of magnitude! As a result ... FLASH has not really caught on (perhaps why I have not be able to update the output) and is dominated by BLAST.

Comparing the four different methods is interesting. A few differences are simply due to the fact that different databases are included with BLAST but many of the differences are real preferences. The FASTA algorithm suggests that the top four ferredoxins related to Halobacterium halobium are FER_HALSP, FER_SYNP4, FER_GALSU, and FER_ANAVA. But via BLAST FER_ANAVA is replaced by FER_PORUM and FER_ANAVA doesn't only enters 18-th in the list (though some of these may be duplicates in other databases). Via FASTA FER_PORUM is 33-rd in the list. Similarly via BLITZ, FER_GALSU is not in the top four but 15-th in the list and FER_PORUM is even further down at 22-nd. This illustrates that one should not use these algorithms to determine how closely related two sequences are. (In general you should use methods that do global comparisons for such a question.) These are however, measures of how strongly they are "hit" or "indicated" to be related. The different methods obviously provide quite different answers. FLASH's results are not really comparable since the output is two years old and the databases have changed a great deal in the mean time.


BLOCKS

The BLOCKS server is somewhat related to the other servers mentioned above (and hence included here) but is designed to answer a different question. Instead of looking for similar sequences in the databases, it scans the PROSITE database to search the query sequence (must be protein or optionally, it will translate your nucleotide sequence to a protein) for similar protein motifs. Blocks are defined as short ungapped (but potentially with variable length) segments of highly conserved regions of proteins. Currently (Jan. 1996) the BLOCKS database searches on 3179 block patterns. This search is particularly useful for analysing distantly related proteins.

The server searches for any of these blocks throughout the query sequence and reports the results via e-mail. For each block there are known frequencies for each amino acid at each site. Every site in the query is matched against these blocks and scored. The highest scoring blocks are reported.

A typical input message is

> Ferredoxin
GIDPNYRTHKPVVGDSSGHKIYGPVESPKVLGVHGTIVGVDFDLCIADGSCITACPVNVF
QWYETPGHPASEKKADPVNQQACIFCMACVNVCPVAAIDVKPP

and there are no options needed for this search (the ">" indicates a sequence title). A nucleotide sequence will be translated in all frames but a nucleotide sequence with IUBPAC ambiguity codes will be interpreted as a protein and will remain untranslated.

This message should be mailed to blocks@howard.fhcrc.org with a blank subject line. Alternatively you can do the search through their web page. (References should cite S.Henikoff & J.Henikoff, 1991 Nucl.Acids.Res. 19:6565-6572).


BLOCKS output

The BLOCKS output is somewhat complicated. It begins with a lengthy informational message that I have deleted and then continues with the guts of the message.

Query=Ferredoxin , 
 Size=103 Amino Acids
Database=mats.dat, Blocks Searched=3179

1.------------------------------------------------------------------------
Block     Rank Frame Score Strength   Location (aa) Description
BL00198      1   0   1271  1239          83-     94 4Fe-4S ferredoxins, iron
BL00198      2   0   1121  1239          45-     56 4Fe-4S ferredoxins, iron

1271=99.91th percentile of anchor block scores for shuffled queries
P not calculated for single block BL00198 
Maximum number of repeats (from Prosite MAX-REPEAT) = 4
1 non-overlapping repeats in support of BL00198 

BL00198     <->    (7,440):82  
 FER_SULAC 83      CIFCMACVNVCP
                   ||||||||||||
Ferredoxin 83      CIFCMACVNVCP
           45      ciadgscitacp

2.------------------------------------------------------------------------
Block     Rank Frame Score Strength   Location (aa) Description
BL00596A     3   0   1020  1367          70-     89 High potential iron-sulf
BL00596A    60   0    943  1367          73-     92 High potential iron-sulf

1020=13.70th percentile of anchor block scores for shuffled queries
P not calculated for single block BL00596A
                         |---   22 amino acids---|
   BL00596 AAAAAAAAAAAAAAAAAAAAAAA::::::::::::::::::::........BBBBBBBBB
Ferredoxin AAAAAAAAAAAAAAAAAAAAAAA
Ferredoxin <  AAAAAAAAAAAAAAAAAAAAAAA

BL00596A    <->A   (8,29):69           
HPI1_ECTVA 21      ASVDHPSHAAGQKCINCLLY
                   ||         | || |   
Ferredoxin 70      ASeKkAdPVNqQaCIfCMac

3.------------------------------------------------------------------------
Block     Rank Frame Score Strength   Location (aa) Description
BL00590C     4   0   1006  1825          51-    100 LIF / OSM family protein

1006=2.56th percentile of anchor block scores for shuffled queries
P not calculated for single block BL00590C
                         |---   93 amino acids---|
   BL00590 AAAAAAAAAAAAAAA...........BBBBBBBBBBBBB.......CCCCCCCCCCCCC
Ferredoxin                                  :::::::::::::CCCCCCCCCCCCC

BL00590C    <->C   (119,193):50                                      
 LIF_HUMAN 153     CRLCSKYHVGHVDVTYGPDTSGKDVFQKKKLGCQLLGKYKQIIAVLAQAF
                   |       |     | |   | |         |           | |   
Ferredoxin 51      CiTAcPvnVfqwyeTPGhPaSeKkAdpvnqqaCiFcmacvnVcpVaAidv

4.------------------------------------------------------------------------
Block     Rank Frame Score Strength   Location (aa) Description
BL00987B     5   0   1005  2124           1-     45 6-pyruvoyl tetrahydropte
BL00987D   320   0    904  2010          26-     75 6-pyruvoyl tetrahydropte

1005=2.10th percentile of anchor block scores for shuffled queries
P not calculated for single block BL00987B
                         |---   60 amino acids---|
   BL00987 AAAA:::BBBBBBBBBBBBBBBBBBBCCCCCCCCCCCCCDDDDDDDDDDDDDDDDDDDDD
Ferredoxin        BBBBBBBBBBBBBBBBBBB
Ferredoxin                  DDDDDDDDDDDDDDDDDDDDD

BL00987B    <->B   (17,18):0                                    
  PTPS_RAT 18      SFSASHRLHSPSLSAEENLKVFGKCNNPNGHGHNYKVVVTIHGEI
                         | | |        |  |    |   |     |       
Ferredoxin 1       gidpnyRtHkPvvgDssghKiyGpvesPkvlGvhgtiVgvdfdlc

5.------------------------------------------------------------------------
Block     Rank Frame Score Strength   Location (aa) Description
BL00144A     6   0   1001  1327          30-     39 Asparaginase / glutamina

1001=0.44th percentile of anchor block scores for shuffled queries
P not calculated for single block BL00144A
                         |---  103 amino acids---|
   BL00144 AA:::::::::::::::..BBBBB:........................CCCCCCCCCCC
Ferredoxin AA

BL00144A    <->A   (5,57):29 
ASG1_YEAST 58      ILGTGGTIAS
                    ||  |||  
Ferredoxin 30      VLGvhGTIvG

6.------------------------------------------------------------------------
Block     Rank Frame Score Strength   Location (aa) Description
BL00261A     7   0   1001  1521          22-     53 Glycoprotein hormones be
BL00261A   114   0    927  1521          26-     57 Glycoprotein hormones be
BL00261B   213   0    912  1614          41-     83 Glycoprotein hormones be
BL00261B   381   0    900  1614          51-     93 Glycoprotein hormones be
BL00261B   397   0    899  1614          54-     96 Glycoprotein hormones be

1001=0.44th percentile of anchor block scores for shuffled queries
P< 0.053 for BL00261B in support of BL00261A
                         |---   39 amino acids---|
   BL00261 AAAAAAAAAAAAAAAAAAAA::::::......BBBBBBBBBBBBBBBBBBBBBBBBBBB
Ferredoxin AAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBBBBBBBBBBB
Ferredoxin <  AAAAAAAAAAAAAAAAAAAA
Ferredoxin <            BBBBBBBBBBBBBBBBBBBBBBBBBBB
Ferredoxin <                  BBBBBBBBBBBBBBBBBBBBBBBBBBB

BL00261A    <->A   (2,55):21                       
GTH2_ANGAN 30      CEPINETISVEKDGCPKCLVFQTSICSGHCIT
                     |          |      |   |  | |||
Ferredoxin 22      YGPVEspkvLgVhGtiVgVdFdlcIadGsCIT

BL00261B   A<->B   (9,19):0                                   
TSHB_ANGAN 73      TYQAVEYRTAELPGCPPHVDPRFSYPVALHCTCRACDPARDEC
                             | || |                | ||       
Ferredoxin 54      acpVnVFqWyEtPGHPaSekkadpVnqqacifCmACvnvcpva

7.------------------------------------------------------------------------
Block     Rank Frame Score Strength   Location (aa) Description
BL01039A     8   0   1001  1283          35-     50 Bacterial extracellular 

1001=0.44th percentile of anchor block scores for shuffled queries
P not calculated for single block BL01039A
                         |---   94 amino acids---|
   BL01039 AAAA:::.BBBBBBBBBB::.......................................C
Ferredoxin AAAA

BL01039A    <->A   (18,65):34      
ARTI_ECOLI 41      NQIVGFDVDLAQALCK
                     ||| | ||  |   
Ferredoxin 35      GtIVGvDfDLCiAdgs

7 possible hits reported


In this case, for ferredoxin, the results are not too interesting - it just says that ferredoxin is probably a ferredoxin (the other blocks reported do not have high percentiles in the distribution. This also does not illustrate some of the features that you should know about. Since the example used in the blocks help file is so fascinating lets look at its output. Its query sequence is based on a ORF from the yeast third chromosome ...

Query=>YCZ2_YEAST  HYPOTHETICAL 40.1 KD PROTEIN IN HMR 3'REGION., 
 Size=368 Amino Acids
Database=/data/blocks_6.0/blocks.dat, Blocks Searched=2302

1.----------------------------------------------------------------------------
Block    Rank Frame Score Strength      Location Description
BL00059A   1     1   1310  2439         2-    42 Zinc-containing alcohol dehyd
BL00059A 371     1    825  2439         0-    40 Zinc-containing alcohol dehyd
BL00059B  15     1    984  1967        52-    77 Zinc-containing alcohol dehyd
BL00059C 105     1    891  2795        77-   134 Zinc-containing alcohol dehyd
BL00059D   2     1   1232  2388       174-   229 Zinc-containing alcohol dehyd

1310=98.5th percentile of anchor block scores for shuffled queries
P<1.4e-06 for BL00059D BL00059B in support of BL00059A
                         |-----  108 residues----|
   BL00059 AAAAAAAAA::.BBBBBB::........CCCCCCCCCCCCC:::...DDDDDDDDDDDDD
>YCZ2_YEAS AAAAAAAAA::BBBBBB::::::::::::::::::::::DDDDDDDDDDDDD
>YCZ2_YEAS A (1,35):1
ADHX_HORSE 9     AAVAWEAGKPVSIEEVEVAPPKAHEVRIKIIATAVCHTDAY
                  ||  | || |  | |         | ||  | |   ||
>YCZ2_YEAS 2     KAVVIEdGKaVVkEgVPiPELeEGfVLIKtLAVAgnpTDwa

BL00059B   A<->B (10,14):9
ADH3_ASPNI 62    PLIGGHEGAGVVVAKGELVKDEDFKI
                   | |   ||  |  |  |   || |
>YCZ2_YEAS 52    GsILGcdAAGqIVKLGPaVdpkDFsI

BL00059D   B<->D (78,122):96
 ADH_CLOBE 173   IGIGAVGLMGIAGAKLRGAGRIIGVGSRPICVEAAKFYGATDILNYKNGHIVDQVM
                  |  |||   |  |        | |          | |||     |     | |
>YCZ2_YEAS 174   gGAtAVGqSLIQlAnKlnGftkIIVvAsrKhEKLlKEYGADqlfDYhDiDvVeQIk

2.----------------------------------------------------------------------------
Block    Rank Frame Score Strength      Location Description
BL00458C   3     1   1077  2417       278-   304 Natriuretic peptides receptor

1077=27.7th percentile of anchor block scores for shuffled queries

                         |-----  320 residues----|
   BL00458 AAA:::::::BBB:::::.CCDDD...............................EEEE
>YCZ2_YEAS <::::::::::::::::::CC

BL00458C    <->C (319,366):277
ANPC_HUMAN 355   NMFVEGFHDAILLYVLALHEVLRAGYS
                 |       |   ||    |||   |
>YCZ2_YEAS 278   NrrqnvtiDrtrLYsiggHEVpfgGiT   


BLOCKS format

The top 400 block scores are retained. Different blocks are compared by dividing the block score by an empirically determined 99.5% calibration score and multiplying by 1000. Hence a score above 1000 is expected for 0.5% of the blocks. Thus a typical protein should yield 16 (0.5% * 3179 = 16) hits just randomly. The top ten blocks are retained and if other blocks are associated with these top ten then these will be reported as well.

The best hits for the ORF from yeast are for BL00059 the zinc-containing alcohol dehydrogenases. This pattern consists of four blocks of conserved amino acids that have characteristic distances between the blocks. The best block for a family is chosen as the anchor block. Empirical tests of the scores expected from randomized proteins have been carried out by the authors and the score for BL00059A is very high in comparison to these random choices (indicating its significance, 98.5th percentile). Next the output lists a probability estimate of finding blocks B and D given that you have found A. This probability is based on several things including the observed distances between the blocks and the number of blocks in a family. Next, comes a diagrammatic map of where the blocks are located in the sequence. Note that A, B and D are all in the proper locations but the rather poor hit for block C is too close to A and B. Next comes a listing of where the blocks should be located. Block A should be 1 to 35 residues from the amino terminus, it is 1 away - block B should be 10 to 14 residues from block A, it is 9 - block D should be 78 to 122 residues from block B, it is 96. Finally, the output searches through representatives of Block A and aligns this with the suggested block in the query sequence (in this case it finds ADHX_HORSE for Block A). The second match listed is typical of a random hit.

This simple but very elegant analysis tells you many things about the protein. In this case it has identified an unknown ORF as a "distant member of a large family, apparently one not easily detected using other approaches. The query [sequence was] not reported to be a member of any family either in the original study or in a subsequent more intensive analysis of ORFs from this chromosome".


Getting the Block

In addition to this the BLOCKS server will allow you access to a copy of the PROSITE database to obtain a listing of the actual block found. You can get the entry either via their web page or via e-mail with a message "get BL00198" to blocks@howard.fhcrc.org and with a blank "subject" line. This will retrieve the entry for the ferredoxin example. The following output will be mailed back to you.

BLOCK BL00198: 4FE4S_FERREDOXIN
4Fe-4S ferredoxins, iron-sulfur binding region signature.

TABLE OF CONTENTS

  1. Block introduction
  2. Block number BL00198
  3. Prosite data file
  4. Prosite entry PS00198
  5. Prosite documentation

These PROSITE entries are from the version of PROSITE used to build the BLOCKS database. When PROSITE is updated there might be a discrepancy for about a weeks time until the new BLOCKS database is built and these corresponding entries are updated.

The SWISS-PROT entries that are linked in reside at the ExPASy World Wide Web (WWW) Molecular Biology Server of the Geneva University Hospital and the University of Geneva.


Block introduction

Blocks Database Version 9.0, December 1995 Copyright 1991 by Fred Hutchinson Cancer Research Center 1124 Columbia Street, A1-162, Seattle, WA 98104 Please cite: S Henikoff & JG Henikoff (1991) Automated assembly of protein blocks for database searching, Nucleic Acids Res. 19:6565-6572. Based on PROSITE 13.0 and SWISS-PROT 32. ID is from PROSITE, AC is derived from the prosite.dat PS#, DE is abstracted from the prosite.dat DE, BL is PROTOMAT information. For each segment, the SWISS-PROT ID is followed by the position of the first residue in the segment. Segments are clustered if >=80% of aligned residues match between any pair of segments. Sequence weights are shown to the right of each segment. The higher the weight (maximum 100) the more dissimilar the segment is from other segments in the block. These weights were obtained using the position-based method of S Henikoff & JG Henikoff (1994), JMB 243:574-578. Pre-computed position-specific scoring matrices were made using pseudo counts from a data-dependent method using Blosum 62 and column totals of five times the number of different amino acids. ========================================================================

[return to toc]

Block BL00198 Logo (postscript viewer required)

ID 4FE4S_FERREDOXIN; BLOCK AC BL00198; distance from previous block=(7,440) DE 4Fe-4S ferredoxins, iron-sulfur binding region proteins. BL ICP motif; width=12; seqs=112; 99.5%=723; strength=1239 ASRA_SALTY ( 230) CISCGRCTTGCP 18 DCMA_METSO ( 441) CVGCQRCEQTCP 27 DHSB_BACSU ( 154) CMTCGVCLEACP 24 FDHB_METFO ( 296) CLKCYGCREACP 27 FER3_DESAF ( 41) CLGCESCVEVCE 20 FER_BACST ( 11) CIACGACGAAAP 22 FER_METTL ( 11) GPECAECVNACP 100 FRDB_WOLSU ( 151) CIECGCCIAACG 39 FRHG_METTH ( 209) CIKCGICYVQCP 24 HMC6_DESVH ( 121) CTCCNRCGQYCP 41 HYCB_ECOLI ( 82) CVSCKLCGIACP 17 NAPF_ECOLI ( 69) CSFCYACAQACP 30 NAPF_HAEIN ( 80) CTFCGKCVDACK 49 NAPH_ECOLI ( 226) CNRCMDCFHVCP 37 NAPH_HAEIN ( 226) CDNCMDCYNVCP 29 PHF1_CLOPA ( 190) CLLCGQCIIACP 15 PHFL_DESVH ( 66) CINCGQCLTHCP 25 YAAT_ECOLI ( 63) CLECGTCRILGL 98 YJES_ECOLI ( 193) CGKCVACMTICP 38 YJJW_ECOLI ( 47) CNDCGECVPQCP 18 ASRC_SALTY ( 212) CIGCGECVLACP 16 COOF_RHORU ( 96) CIGCKLCVMVCP 11 DMSB_ECOLI ( 98) CIGCRYCHMACP 16 DMSB_HAEIN ( 99) CIGCRYCHMACP 16 FDHB_WOLSU ( 91) CIGCGYCLYACP 18 FDNH_ECOLI ( 133) CIGCGYCIAGCP 11 FDOH_ECOLI ( 133) CIGCGYCIAGCP 11 FDXH_HAEIN ( 139) CIGCGYCIAGCP 11 FDXN_RHILT ( 10) CTQCGACEFECP 17 FER1_AZOVI ( 39) CIDCALCEPECP 10 FER1_CHLLI ( 8) CTYCGACEPECP 10 FER1_RHOCA ( 9) CTSCGDCEPVCP 12 FER2_CHLLI ( 8) CTYCAACEPECP 11 FER2_RHOCA ( 39) CIDCGVCEPECP 9 FER3_ANAVA ( 75) CIGCQACARACP 10 FER3_PLEBO ( 75) CIGCEACSRVCP 8 FER3_RHOCA ( 79) CIGCGACARVCP 7 FERN_AZOVI ( 9) CVNCWACVDVCP 19 FERN_BRAJA ( 10) CTSCSACEPLCP 22 FERN_RHIME ( 10) CTQCGACEFECP 17 FERV_AZOVI ( 10) CTVCGDCEPVCP 20 FERX_ANASP ( 9) CISCKLCSSVCP 13 FER_ALIAC ( 39) CIDCAACEPVCP 8 FER_BUTME ( 8) CIACGSCADQCP 15 FER_CHLLT ( 8) CTYCGACEPECP 10 FER_CHRVI ( 8) CINCNVCQPECP 15 FER_CLOTH ( 10) CIACGTCIDLCP 12 FER_ENTHI ( 41) CIGCGACVDACP 7 FER_MEGEL ( 36) CIDCGACEAVCP 7 FER_MYCSM ( 39) CVDCGACEPVCP 8 FER_PEPAS ( 35) CIDCGSCASVCP 11 FER_PSEPU ( 39) CIDCALCEPECP 10 FER_SACER ( 39) CVDCGACEPVCP 8 FER_STRGR ( 39) CVDCGACEPVCP 8 FER_SULAC ( 83) CIFCMACVNVCP 11 FER_THEAC ( 123) CIFCMACESVCP 12 FIXG_RHIME ( 280) CVDCNACVAVCP 11 FRDB_ECOLI ( 148) CINCGLCYAACP 10 FRDB_HAEIN ( 160) CINCGLCYAACP 10 FRDB_PROVU ( 149) CINCGLCYAACP 10 GLPC_ECOLI ( 9) CIKCTVCTTACP 13 GLPC_HAEIN ( 32) CIKCTACTAVCP 12 HMC2_DESVH ( 142) CVGCRYCMVACP 14 HYCF_ECOLI ( 40) CIGCAACVNACP 8 HYDN_ECOLI ( 89) CIGCKTCVVACP 10 MBHT_ECOLI ( 145) CTGCRYCMVACP 14 NAPG_ECOLI ( 61) CVRCGQCVQACP 12 NAPG_HAEIN ( 72) CIRCGQCVQACP 11 NQO9_PARDE ( 103) CIYCGFCQEACP 15 NRFC_ECOLI ( 125) CVGCQYCIAACP 11 NRFC_HAEIN ( 127) CIGCQYCIAVCP 10 NUIC_MAIZE ( 64) CIACEVCVRVCP 8 NUIC_MARPO ( 64) CIACEVCVRVCP 8 NUIC_ORYSA ( 62) CIACEVCVRVCP 8 NUIC_PLEBO ( 64) CIACEVCVRVCP 8 NUIC_SYNY3 ( 65) CIACEVCVRVCP 8 NUIC_TOBAC ( 64) CIACEVCVRVCP 8 NUIC_WHEAT ( 66) CIACEVCVGVCP 16 NUIM_BOVIN ( 152) CIYCGFCQEACP 15 NUIM_RHOCA ( 103) CIYCGYCQEACP 11 NUOI_ECOLI ( 98) CIFCGLCEEACP 9 PHSB_SALTY ( 96) CIGCDYCVAACP 16 PSAC_ANASP ( 10) CIGCTQCVRACP 7 PSAC_ANTSP ( 10) CIGCTQCVRACP 7 PSAC_CHLRE ( 10) CIGCTQCVRACP 7 PSAC_CYAPA ( 10) CIGCTQCVRACP 7 PSAC_EUGGR ( 10) CIGCTQCVRACP 7 PSAC_FREDI ( 10) CIGCTQCVRACP 7 PSAC_MAIZE ( 10) CIGCTHCVRACP 15 PSAC_MARPO ( 10) CIGCTQCVRACP 7 PSAC_PEA ( 10) CIGCTQCVRACP 7 PSAC_PINTH ( 10) CIGCTQCVRACP 7 PSAC_SPIOL ( 10) CIGCTQCVRACP 7 PSAC_SYNEN ( 10) CIGCTQCVRACP 7 PSAC_TOBAC ( 10) CIGCTQCVRACP 7 PSAC_WHEAT ( 10) CIGCTQCVRACP 7 PSAX_SYNY3 ( 10) CIGCTQCVRACP 7 PSRB_WOLSU ( 93) CVGCLYCIAACP 18 RDXA_RHOSH ( 251) CIDCMACVNVCP 10 YA43_HAEIN ( 53) CNGCGECASACP 16 YFFE_ECOLI ( 82) CIGCKLCAVVCP 11 DHSB_DROME ( 195) CILCACCSTSCP 14 DHSB_ECOLI ( 149) CILCACCSTSCP 14 DHSB_HUMAN ( 186) CILCACCSTSCP 14 DHSB_USTMA ( 195) CILCACCSTSCP 14 DHSB_YEAST ( 179) CILCACCSTSCP 14 FER1_DESAF ( 11) CIACESCVEIAP 24 FER2_DESVM ( 11) CMACESCVELCP 15 FER_DESGI ( 8) CMACEACVEICP 14 FIXX_AZOCA ( 65) CVECGTCRVIAE 34 FIXX_BRAJA ( 66) CIECGTCRVIAE 33 FIXX_RHILP ( 67) CMECGTCRVLCE 24 //

[return to toc]



Prosite data file

1 blocks processed CC ************************************************************************* CC CC ************************* CC *** PROSITE data file *** CC ************************* CC CC Release 13.0 of November 1995 CC CC ************************************************************************* CC CC The patterns section of PROSITE is developed by: CC CC Amos Bairoch CC Medical Biochemistry Department CC CMU CC University of Geneva CC 1, Rue Michel Servet, 1211 Geneva 4 CC Switzerland CC CC Email : bairoch@cmu.unige.ch CC Telephone: (+41 22) 784 40 82 CC CC CC The profiles/matrices section of PROSITE is developed by: CC CC Philipp Bucher and Kay Oliver Hofmann CC Biocomputing ISREC CC Institut Suisse de Recherches Experimentales sur le Cancer CC 155 ch. des Boveresses, 1066 Epalinges s/Lausanne CC Switzerland CC CC Email : pbucher@isrec-sun1.unil.ch CC khofmann@isrec-sun1.unil.ch CC Telephone: (+41 21) 624 99 43 CC CC ************************************************************************* CC CC This file may be copied and redistributed freely, without advance CC permission. You are allowed to reformat it for use with a software CC package, but you should not modify its content without permission CC from the author). CC CC ************************************************************************* //

[return to toc]

Prosite entry PS00198

ID 4FE4S_FERREDOXIN; PATTERN. AC PS00198; DT APR-1990 (CREATED); APR-1990 (DATA UPDATE); NOV-1995 (INFO UPDATE). DE 4Fe-4S ferredoxins, iron-sulfur binding region signature. PA C-x(2)-C-x(2)-C-x(3)-C-[PEG]. NR /RELEASE=32,49340; NR /TOTAL=231(158); /POSITIVE=200(140); /UNKNOWN=4(2); /FALSE_POS=27(16); NR /FALSE_NEG=7; /PARTIAL=0; CC /TAXO-RANGE=A?EP?; /MAX-REPEAT=6; CC /SITE=1,iron_sulfur; /SITE=3,iron_sulfur; /SITE=5,iron_sulfur; CC /SITE=7,iron_sulfur; DR P00214, FER1_AZOVI, T; P18082, FER2_RHOCA, T; P80448, FER2_RHORU, T; DR P00215, FER_MYCSM , T; P24496, FER_SACER , T; P13279, FER_STRGR , T; DR P00213, FER_PSEPU , T; P08811, FER_PSEST , T; P03942, FER_THETH , T; DR P00198, FER_CLOAC , T; P00196, FER_CLOBU , T; P00195, FER_CLOPA , T; DR P22846, FER_CLOPE , T; P00197, FER_CLOSP , T; P80168, FER_CLOST , T; DR P07508, FER_CLOTM , T; P00200, FER_CLOTS , T; P00201, FER_MEGEL , T; DR P00193, FER_PEPAS , T; P00194, FER1_RHORU, T; P14073, FER_BUTME , T; DR P00205, FER_CHLLT , T; P00204, FER1_CHLLI, T; P00206, FER2_CHLLI, T; DR P00208, FER_CHRVI , T; P00202, FER_METBA , T; P21305, FER_METTL , T; DR P00218, FER_THEAC , T; P00211, FER2_DESDN, T; P08812, FER3_DESAF, T; DR P08813, FER1_DESVM, T; P11425, FER_ENTHI , T; P12415, FERX_ANASP, T; DR P06123, FERN_AZOCH, T; P14939, FERV_AZOVI, T; P42711, FDXN_RHILT, T; DR P12712, FERN_RHIME, T; P27394, FERN_BRAJA, T; P16021, FER1_RHOCA, T; DR P03941, FER_ALIAC , T; P00219, FER_SULAC , T; P00207, FER1_RHOPA, T; DR P11054, FERN_AZOVI, T; P46050, FER3_ANAVA, T; P46036, FER3_PLEBO, T; DR P20624, FER3_RHOCA, T; P00203, FER_CLOTH , T; P00209, FER_DESGI , T; DR P07485, FER1_DESDN, T; P10624, FER2_DESVM, T; P29604, FER_THELI , T; DR P46797, FER_THEMA , T; Q05561, FIXX_RHILP, T; P09822, FIXX_RHIME, T; DR P08710, FIXX_RHILT, T; Q06439, PSAC_ANTSP, T; Q00914, PSAC_CHLRE, T; DR P42046, PSAC_CUCSA, T; P31556, PSAC_EUGGR, T; P11601, PSAC_MAIZE, T; DR P06251, PSAC_MARPO, T; P10793, PSAC_PEA , T; P41649, PSAC_PINTH, T; DR P10098, PSAC_SPIOL, T; P07136, PSAC_TOBAC, T; P10794, PSAC_WHEAT, T; DR P31173, PSAC_CYAPA, T; P23392, PSAC_ANASP, T; P31086, PSAC_ANAVA, T; DR P23810, PSAC_FREDI, T; P18083, PSAC_SYNEN, T; P31087, PSAC_SYNP2, T; DR P31085, PSAC_SYNP6, T; P25252, PSAC_SYNY3, T; P32422, PSAX_SYNY3, T; DR P08066, DHSB_BACSU, T; P07014, DHSB_ECOLI, T; P00364, FRDB_ECOLI, T; DR P44893, FRDB_HAEIN, T; P20921, FRDB_PROVU, T; P20925, YFRA_PROVU, T; DR P17596, FRDB_WOLSU, T; P06130, FDHB_METFO, T; P19498, FRHG_METTH, T; DR P18396, FIXG_RHIME, T; Q01854, RDXA_RHOSH, T; P07598, PHFL_DESVH, T; DR P13629, PHFL_DESVO, T; P31894, COOF_RHORU, T; P18776, DMSB_ECOLI, T; DR P45003, DMSB_HAEIN, T; P23481, YFFE_ECOLI, T; P24184, FDNH_ECOLI, T; DR P32175, FDOH_ECOLI, T; P44450, FDXH_HAEIN, T; P27273, FDHB_WOLSU, T; DR P33389, HMC2_DESVH, T; P33393, HMC6_DESVH, T; P26474, ASRA_SALTY, T; DR P13034, GLPC_ECOLI, T; P43801, GLPC_HAEIN, T; P16428, HYCB_ECOLI, T; DR P16432, HYCF_ECOLI, T; P30132, HYDN_ECOLI, T; P37601, PHSB_SALTY, T; DR P31076, PSRB_WOLSU, T; P32708, NRFC_ECOLI, T; P45015, NRFC_HAEIN, T; DR P33939, NAPF_ECOLI, T; P44650, NAPF_HAEIN, T; P33936, NAPG_ECOLI, T; DR P44652, NAPG_HAEIN, T; P33934, NAPH_ECOLI, T; P44653, NAPH_HAEIN, T; DR P32815, YGL5_BACST, T; P39288, YJES_ECOLI, T; P44101, YA43_HAEIN, T; DR P32420, DHSB_USTMA, T; P21801, DHSB_YEAST, T; P21911, DHSB_SCHPO, T; DR P21912, DHSB_HUMAN, T; P21913, DHSB_RAT , T; P21914, DHSB_DROME, T; DR P21915, DHSB_ARATH, T; P37179, MBHT_ECOLI, T; P29166, PHF1_CLOPA, T; DR P26476, ASRC_SALTY, T; P46722, NUIC_MAIZE, T; P06253, NUIC_MARPO, T; DR P12099, NUIC_ORYSA, T; P06252, NUIC_TOBAC, T; P05312, NUIC_WHEAT, T; DR Q00236, NUIC_PLEBO, T; P26525, NUIC_SYNY3, T; P42028, NUIM_BOVIN, T; DR P42031, NUIM_RHOCA, T; P29921, NQO9_PARDE, T; P33604, NUOI_ECOLI, T; DR P26692, DCMA_METSO, T; P39409, YJJW_ECOLI, T; DR Q06879, NIFJ_ANASP, ?; P03833, NIFJ_KLEPN, ?; DR P00212, FER_BACST , N; P10245, FER_BACTH , N; P00210, FER1_DESAF, N; DR P26485, FIXX_AZOCA, N; P10326, FIXX_BRAJA, N; P09823, FIXX_RHILE, N; DR P31576, YAAT_ECOLI, N; DR P05687, CHH2_BOMMO, F; P20730, CHHC_BOMMO, F; P30826, ISP1_TRYBB, F; DR P05107, ITB2_HUMAN, F; P26010, ITB7_HUMAN, F; P26011, ITB7_MOUSE, F; DR P26372, KRUC_SHEEP, F; Q01642, M84A_DROME, F; Q01643, M84B_DROME, F; DR Q01644, M84C_DROME, F; Q01645, M84D_DROME, F; P08175, M87F_DROME, F; DR P23327, SRCH_HUMAN, F; P16230, SRCH_RABIT, F; P37127, YFFG_ECOLI, F; DR P45866, YWJF_BACSU, F; 3D 5FD1; 1FD2; 2FD2; 1FDA; 1FDB; 1FDC; 1FDD; 1FDX; 1FER; 2FXB; 1FXD; DO PDOC00176;

[return to toc]



Prosite documentation

The following lines are also links to the prosite entries at the ExPASy World Wide Web (WWW) molecular biology server of the Geneva University Hospital and the University of Geneva.

{PDOC00176}
{PS00198; 4FE4S_FERREDOXIN}
{BEGIN}
************************************************************
* 4Fe-4S ferredoxins, iron-sulfur binding region signature *
************************************************************

Ferredoxins  [1]  are a group  of  iron-sulfur proteins which mediate electron
transfer in  a  wide variety  of   metabolic  reactions.   Ferredoxins  can be
divided into several subgroups  depending upon the physiological nature of the
iron-sulfur  cluster(s).   One of these  subgroups are the 4Fe-4S ferredoxins,
which  are  found  in   bacteria  and  which   are thus  often    referred  as
'bacterial-type' ferredoxins.  The structure of these proteins [2] consists of
the duplication of a  domain of twenty six amino  acid residues; each of these
domains contains four cysteine residues that bind to a 4Fe-4S center.

A number  of  proteins  have  been found [3]  that  include one or more 4Fe-4S
binding domains similar to those of bacterial-type ferredoxins. These proteins
are  listed  below  (references  are  only  provided  for  recently determined
sequences).

 - The iron-sulfur proteins  of  the succinate  dehydrogenase and the fumarate
   reductase  complexes (EC 1.3.99.1).  These  enzyme   complexes,  which  are
   components of the tricarboxylic  acid cycle, each contain three subunits: a
   flavoprotein,  an iron-sulfur protein,  and a b-type cytochrome.  The iron-
   sulfur proteins contain three  different  iron-sulfur  centers: a 2Fe-2S, a
   3Fe-3S  and a 4Fe-4S.
 - Escherichia coli anaerobic glycerol-3-phosphate dehydrogenase (EC 1.1.99.5)
   This enzyme is composed of three subunits: A, B, and C. The C subunit seems
   to be an  iron-sulfur  protein  with  two ferredoxin-like domains in the N-
   terminal part of the  protein.
 - Escherichia coli anaerobic dimethyl sulfoxide reductase.  The  B subunit of
   this  enzyme  (gene dmsB)  is  an  iron-sulfur  protein  with  four  4Fe-4S
   ferredoxin-like domains.
 - Escherichia coli  formate  hydrogenlyase.  Two   of  the  subunits  of this
   oligomeric complex (genes hycB and hycF) seem  to  be  iron-sulfur proteins
   that each contain two 4Fe-4S ferredoxin-like domains.
 - Methanobacterium formicicum formate dehydrogenase (EC 1.2.1.2). This enzyme
   is used by the archaebacteria  to grow on formate.  The  beta chain of this
   dimeric enzyme probably binds two 4Fe-4S centers.
 - Escherichia  coli  formate  dehydrogenases  N  and O (EC 1.2.1.2). The beta
   chain  of these two enzymes (genes fdnH and fdoH) are  iron-sulfur proteins
   with four 4Fe-4S ferredoxin-like domains.
 - Desulfovibrio periplasmic [Fe] hydrogenase (EC 1.18.99.1).  The large chain
   of this dimeric enzyme binds three 4Fe-4S centers, two of which are located
   in the ferredoxin-like N-terminal region of the protein.
 - Methanobacterium  thermoautrophicum  methyl  viologen-reducing  hydrogenase
   subunit mvhB, which contains six  tandemly repeated ferredoxin-like domains
   and which probably binds twelve 4Fe-4S centers.
 - Salmonella typhimurium anaerobic sulfite reductase (EC 1.8.1.-) [4]. Two of
   the subunits of  this enzyme (genes asrA and asrC) seem  to  both  bind two
   4Fe-4S centers.
 - A Ferredoxin-like protein  (gene fixX)  from  the  nitrogen-fixation  genes
   locus of various  Rhizobium  species,  and   one  from  the  Nif-region  of
   Azotobacter species.
 - The 9 Kd  polypeptide  of chloroplast photosystem I [5]  (gene psaC).  This
   protein contains two low potential 4Fe-4S centers, referred as  the A and B
   centers.
 - The chloroplast frxB protein which is predicted to carry two 4Fe-4S centers.
 - An ferredoxin  from a  primitive  eukaryote,  the enteric amoeba  Entamobea
   histolytica.
 - Escherichia  coli  hypothetical  protein  yjjW, a protein with a N-terminal
   region belonging to the radical activating enzymes family (see <PDOC00834>)
   and two potential 4Fe-4S centers.

The pattern of cysteine  residues in the  iron-sulfur region  is sufficient to
detect this class of 4Fe-4S binding proteins.

-Consensus pattern: C-x(2)-C-x(2)-C-x(3)-C-[PEG]
                    [The four C's are 4Fe-4S ligands]
-Sequences known to belong to this class detected by the pattern: the majority
 of known 4Fe-4S sequences, with at least 5 exceptions.
-Other sequence(s) detected in SWISS-PROT: 14.

-Note: in some  bacterial  ferredoxins,  one of the two duplicated domains has
 lost one or  more of  the four conserved  cysteines.  The consequence of such
 variations is that these domains  have  either lost their iron-sulfur binding
 property or bind to a 3Fe-3S center instead of a 4Fe-4S center.
-Note: the last residue of this  pattern in most  proteins  belonging  to this
 group,  is a Pro; the  only  exceptions  are  the  Rhizobium  ferredoxin-like
 proteins which have Gly, and two Desulfovibrio ferredoxins which have Glu. It
 must also be noted that  the  three  non  4Fe-4S-binding  proteins  which are
 picked-up by the pattern have Gly in this position of the pattern.

-Last update: November 1995 / Text revised.

[ 1] Meyer J.
     Trends Ecol. Evol. 3:222-226(1988).
[ 2] Otaka E., Ooi T.
     J. Mol. Evol. 26:257-267(1987).
[ 3] Beinert H.
     FASEB J. 4:2483-2492(1990).
[ 4] Huang C.J., Barrett E.L.
     J. Bacteriol. 173:1544-1553(1991).
[ 5] Knaff D.B.
     Trends Biochem. Sci. 13:460-461(1988).
//
[return to toc]


Blocks home


This is probably more information about ferredoxin than you would ever want. The above logo link even gives a graphical view of the nature of the protein block. BUT if you do want more then there are references and on some entries, they also list a contact person who is an expert on "whatever" and in some entries they will often give you this person's e-mail address.

The actual output consists of several parts; one part from the Hutchinson center and the other part a copy of the PROSITE entry. The output begins a note about blocks in general, the Block entry for #BL00198 along with all the database entries that this block occurs in. with a listing of the other known ferredoxin conserved blocks. In this case you can see that it ranges from 5 to 438 residues into the protein (i.e. anywhere). Then comes a general statement regarding PROSITE, the PROSITE entry with a great deal of information. The pattern for these iron-sulfur binding region signatures is CXXCXXCXXXC[PEG], where [PEG] stands for either proline, glutamic acid or glycine. The other pattern that you are likely to run across is {A,G} which means any residue except alanine or glycine. Besides all of the other information, the listing also gives you information on how well this pattern detects iron-sulfur binding proteins. In this case there were a total of 231 blocks in 158 sequences - of these this pattern will detect 200, with 27 false positives from 16 sequences and 7 false negatives (4 are of uncertain affinity). Last comes a general description of the ferredoxin pattern in proteins and a list of references.

A really great resource - this was part of Amos Bairoch's Ph.D. thesis.


Quick Search

Often you are interested in simply finding out if what you have sequenced is already known (a more and more common occurrence) rather than is there anything homologous to it. This question can be answered much more easily and EMBL has implemented the QUICKSEARCH server to carry it out. The program will detect hits even if there are a small number of mismatches.

The input message should be of the form ...

MATCH 90
BEST
TITLE This is an example
SEQ
                AAACCATATAGGCCCTTTT

The sequence must be a nucleotide sequence. Only SEQ is required, all others will default. The MATCH n option says that only entries with more than n% identity should be reported [90]. The BEST option says that a Smith-Waterman alignment should be done rather than a Needleman-Wunsch alignment.

The QUICKSEARCH method is very similar to the FASTA approach but uses a very large WORD size for the "hash" table and hence the difference in speed. Again this is more appropriate to ask are there very similar sequences already in the database. It can not answer if there are distantly related sequences. The original program was part of J.Devereux's Ph.D. thesis.

The input file should be mailed to quick@ebi.ac.uk.

SSearch

At the opposite extreme is SSEARCH. This does a universal sequence comparison using the Smith-Waterman algorithm ( T. F. Smith and M. S. Waterman (1981) J.Mol.Biol. 147:195-197). This program uses code developed by Huang and Miller (X. Huang, R. C. Hardison, W. Miller (1990) CABIOS 6:373-381) for calculating the local similarity score and code from the ALIGN program (see below) for calculating the local alignment. SSEARCH is about 50-times slower than FASTA with ktup=2 (for proteins).


Why you should routinely check your sequence

The following is an example of why you should routinely do a search (FASTA, BLAST or whatever) for any new sequence that you are working on. This is a copy of a letter to the editor of NATURE vol. 358, p.271.


Fact and fiction in alignment.

Sir - We have discovered a startling similarity between a dinosaur DNA sequence reported in the novel Jurassic Park~1 and a partial human brain cDNA sequence from the Venter laboratory described in Nature~2 (see figure). HUMXT 317 GCGTTGCTGGCGTTTTTCCATAGGCTCCGACCCCCTGACGAGCATCACAAAAATCGACGCTCAA ***************************** ****************************** DINO1 1 GCGTTGCTGGCGTTTTTCCATAGGCTCCGCCCCCCTGACGAGCATCACAAAAATCGACGC---- *************************************************** * DINO1 670 GCGTTGCTGGCGTTTTTCCATAGGCTCCGCCCCCCTGACGAGCATCACAAACAAGTCAGA---- HUMXT 234 GTCANAGGTGGCGGAAACCCGACAGGACTATAAAGATACCAGGCGTTTCCCCCTTGGAGCTTCC ******* **************************************** * **** ** DINO1 61 ------GGTGGCG-AAACCCGACAGGACTATAAAGATACCAGGCGTTTCCCCCTGGAAGCTCCC ******* ********************************************** * * DINO1 730 ------GGTGGCG-AAACCCGACAGGACTATAAAGATACCAGGCGTTTCCCCCTGGAAGCGCTC The dinosaur sequence (DINO1) consists of duplication, with 117 base pairs from the first member of the repeat aligning with the human sequence, HUMXT01431, at the 95 per cent level of identity with only two gaps. The extraordinary degree of nucleotide sequence conservation between organisms as distantly related as dinosaur and human suggests strongly conserved function. Expression of HUMXT01431 in human brain raises the possibility that the dinosaurs were smarter than has been supposed, arguing against the hypothesis that their extinction resulted from lack of intelligence. Our discovery also seems to raise the interesting legal question as to whether the copyright on Jurassic Park takes precedence over the pending patent on the human sequence. However, it appears that neither group is entitled to legal protection for its sequence, because both sequences also align with cloning vector pBR322, raising the possibility that both groups inadvertently sequenced vector DNA. Alan C. Christensen, Dept of Biochemistry and Molecular Biology, Thomas Jefferson University, Philadelphia, Pennsylvania, 19107 USA. Steven Henikoff, Howard Hughes Medical Institute and Basic Sciences Division, Fred Hutchinson Cancer Research Center, Seattle Washington 98104 USA. 1 Crichton, M. Jurassic Park, 102 (Ballantine, New York 1990). 2 Adams, M.D. et al., Nature 355, 632-634 (1992).
With such good jokers in the world as these gentlemen are, you don't want to get caught by them.

_______________________________

Return to index