BLAST (Basic Local Alignment Search Tool) is an algorithm to search large
databases for homologs in a reasonably short amount of time. The program does
this by breaking the query sequence into a number of 'words' (for proteins,
the default value is three protein residues, for nucleotide, 11).
The database is
then searched for matches to these words, permitting inexact matches. The
program then tries to build on these 'word' hits by extending the
alignment out on either side of the core word, until the cumulative score
can no longer be increased: these extended hits are called HSPs
(high-scoring pairs). Because BLAST finds local (as opposed to global)
alignments, it is able to find relationships among sequences that
have only isolated regions of similarity.
All statistically significant HSPs are sorted by a
sorting algorithm, so that the best matches are presented first.
The significant matches are also aligned to show where the homologous
regions are.
The BLAST suite of programs is designed so that it is possible to
use either protein or DNA sequences to search either protein or DNA
databases. The programs available are:
- BLASTN: searches a DNA sequence against a DNA database
- BLASTP: searches a protein sequence against a protein database
- BLASTX: searches a DNA sequence, translated in all six reading frames,
against a protein database
- TBLASTN: searches a protein sequence against a DNA database, translated
in all six reading frames
- TBLASTX: searches a DNA sequence, translated in all six reading frames,
against a DNA database, translated in all six reading frames
In general, if you have a coding region of a DNA sequence,
it is advisable to first translate it into protein before carrying out
a search because a DNA sequence will contain far less information
with which homologies can be detected after
a very short space of evolutionary time than will the same
encoded protein sequence due to the redundancy of codon to amino acid
translation in the genetic code.
The DNA sequence can be translated
here, and then BLASTP can be used to search a protein database. If
your sequence is non-coding, use
FASTA. You should rarely have to use BLASTN. If you want to do a
preliminary check for frameshift errors in your sequence, use BLASTX to
compare your sequence translated in all six reading frames against a protein
database. If you want to search for a particular protein sequence in an
EST database, use TBLASTN.
The BLAST pages have recently been redesigned for greater ease of use.
They are divided into six sections, of which the first three are
pointers to the usual suite of BLAST programs. The first section deals
with nucleotide against nucleotide searches, i.e. BLASTN, and includes
the MegaBLAST program, which is useful when you want to do large scale
nucleotide-nucleotide analysis, such as searching the whole of dbEST
(the EST, or expressed sequence tag, database) against the whole of
the nucleotide non-redundant database (nr). It also includes a pointer
to a search for short nearly exact matches, which is the same page as
the normal BLASTN server, but the parameters have been pre-set to
optimize the likelihood of finding meaningful matches with short queries,
which are likely to appear in the database numerous times, just by chance.
The second section deals with protein-protein searches, and includes the
BLASTP program, a page with pre-set parameters for searching with short
sequences as for the BLASTN program, and also includes a link to the
PSI/PHI-BLAST program, which use iterative searches with profiles to
increase the programs sensitivity for finding more distantly related
matches.
The third section includes the remaining suite of BLAST programs, which
involve translation from a nucleotide to its protein sequence either
for the query sequence, the database sequences, or both.
Each program has the same look-and-feel as the others, although some
programs have more options than others. Those options common to all programs
include:
- Search sequence: this gives you a text-entry box into which you
can cut and paste your query sequence (or the accession number or GI).
- Subsequence: this allows you to choose a particular region of your
query sequence to search with, instead of the whole sequence.
- Database: this is a pull down menu with a list of the relevant
databases for that program, e.g. protein only databases for BLASTP.
- Limit by Entrez query / organism: this forces the program to
return only matches which conform to your specifications, e.g. only
sequences which were modified after January 2001, or only matches
with a sequence length of greater than 200 residues, and that have
the word 'pancreas' in the entry. The pull-down menu
allows you to limit your matches to a specific organism, or a specific
taxonomic class, e.g. mammalia.
- Filter: the default is to filter the query sequence for low
complexity regions, which may be common in the database, and yield
statistically significant but biologically uninteresting results. The
BLASTN program also allows you to filter for human repeats (although this
is only useful if your query sequence is from human).
- Expect: this is the expectation value, or how many hits you would expect
to find by chance. The lower this value is, the more stringent the search becomes, and
the less likely it will be that you will get completely chance matches in
your results. When searching for a short, or common, motif, you should
increase the expectation value (to a maximum of 1000.0). If
you are only interested in very precise homologs, set the expectation value to
0.0001. The default value is 10.
- Word size:
- Other advanced options include:
- Cost to open a gap [Integer]
- Cost to extend a gap [Integer]
- Expectation value (E) [Real]
- Word size
- Number of one-line descriptions (V) [Integer]
- Number of alignments to show (B) [Integer]
- Formatting options: these detail how the search output should
look. Formatting choices can be made at the outset of the search,
and can be changed later.
Each program also has its own set of specific options:
- Matrix (and gap costs) (all BLAST programs except BLASTN):
a matrix
assigns a score for every possible pair of residues,
giving different numbers
of scoring points to matches or mismatches between amino acid residues. The
choice of matrix may influence the order in which sequence hits are
reported, and sometimes determine whether a hit is reported at all. The default
matrix is the BLOSUM 62 matrix, which is based on observed substitutions
in a database of aligned sequences where 62% of the residues are identical.
The distribution of the remaining 38% is analyzed to yield the matrix.
Use of this matrix will find most weak protein similarities.
Other matrices, such as the BLOSUM90 or PAM30 matrices are based on
sequences that are evolutionarily close; matrices such as BLOSUM30 or BLOSUM40
or PAM250 are based on evolutionarily distant sequences. Use the latter
to pick up longer, more distant, homologs.
The raw score of an alignment is the sum of the scores for aligning pairs
of residues and the scores for gaps. Increasing the gap costs (gap opening
and gap extension penalties) will result in alignments with fewer inserted
gaps.
- File (MegaBLAST):
allows you to upload a file of sequences, because
MegaBLAST is designed to run quickly on a large number of sequences.
- Return alignment end points only (MegaBLAST):
this returns only the
start and end points of the HSPs found, and not the alignment of the HSP
itself. This makes the search run slightly faster, and cuts down on the
amount of output returned.
- Percent ID (MegaBLAST):
allows you to choose the percent identity
threshold that HSPs must pass to be included in the output. Generally,
this will be set to a high value, so that only exact, or nearly exact matches
are returned.
- Genetic codes (translated BLAST programs):
Specifies which genetic code to use when
translating a nucleotide sequence to amino acid. By default, the
standard code is used.
- CD Search (BLASTP, PSI-/PHI-BLAST):
compares protein sequences to the Conserved Domain Database,
which is a database of functional and structural domains. This can help to
identify domains within a protein sequence.
- Composition based statistics (BLASTP, PSI-/PHI-BLAST):
when calculating
e-values, takes into account the amino acid composition of the individual
database sequences involved in reported alignments. This improves
E-value accuracy, thereby reducing the number of false positive results.
- PSSM (BLASTP, PSI-/PHI-BLAST):
this allows the use of a previously
constructed Position Specific Score Matrix (saved from a PSI-BLAST search).
This gives you a profile of the residues at each position, and can help
in finding more distantly related homologs.
- PHI Pattern (BLASTP, PSI-/PHI-BLAST):
allows the use of a pattern
(generated by PHI-BLAST) to help identify distantly related homologs.
|