The BLOSUM matrices originate with a paper by Henikoff and Henikoff (1992; PNAS 89:10915-10919). Their idea was to get a better measure of differences between two proteins specifically for more distantly related proteins. While this bias limits the usefulness of BLOSUM matrices for some purposes, for other programs such as FASTA, BLAST, etc. it should do substantially better. This is because the need for an accurate measure of distance is not as great when peptides are more closely related.
They use the BLOCKS database to search for differences among sequences
but only among the very conserved regions of a protein family. Hence
the term BLOSUM is from BLOcks SUbstitution Matrix. They first collect
all of the sequences in the BLOCKS database and then for each one they
sum the number of amino acids in each site to get a frequency table
(
) of how often different pairs of amino acids are
found together in these conserved regions. Hence the observed
frequency of occurrence of one amino acid is

Given pairs should occur with frequencies

and

The odds matrix is
. Generally
's are taken of
this matrix to give a
or lod matrix such that

Hence if the observed number of differences between a pair of amino
acids is equal to the expected number than
. If the
observed is less than expected then
and if the observed is
greater than expected
.
All of this gives the BLOSUM matrix. Different levels of the BLOSUM matrix can be created by differentially weighting the degree of similarity between sequences. For example, a BLOSUM62 matrix is calculated from protein blocks such that if two sequences are more than 62% identical, then the contribution of these sequences is weighted to sum to one. In this way the contributions of multiple entries of closely related sequences is reduced.
The BLOSUM62 matrix is given in Table 2. If the BLOSUM62 matrix is compared to PAM160 (it's closest equivalent) then it is found that the BLOSUM matrix is less tolerant of substitutions to or from hydrophilic amino acids, while more tolerant of hydrophobic changes and of cysteine and tryptophan mismatches.
