Course home > How to use blast

BLAST (Basic Local Alignment Search Tool) is an algorithm to search large databases for homologs in a reasonably short amount of time. The program does this by breaking the query sequence into a number of 'words' (for proteins, the default value is three protein residues, for nucleotide, 11). The database is then searched for matches to these words, permitting inexact matches. The program then tries to build on these 'word' hits by extending the alignment out on either side of the core word, until the cumulative score can no longer be increased: these extended hits are called HSPs (high-scoring pairs). Because BLAST finds local (as opposed to global) alignments, it is able to find relationships among sequences that have only isolated regions of similarity. All statistically significant HSPs are sorted by a sorting algorithm, so that the best matches are presented first. The significant matches are also aligned to show where the homologous regions are.

The BLAST suite of programs is designed so that it is possible to use either protein or DNA sequences to search either protein or DNA databases. The programs available are:

  • BLASTN: searches a DNA sequence against a DNA database
  • BLASTP: searches a protein sequence against a protein database
  • BLASTX: searches a DNA sequence, translated in all six reading frames, against a protein database
  • TBLASTN: searches a protein sequence against a DNA database, translated in all six reading frames
  • TBLASTX: searches a DNA sequence, translated in all six reading frames, against a DNA database, translated in all six reading frames

In general, if you have a coding region of a DNA sequence, it is advisable to first translate it into protein before carrying out a search because a DNA sequence will contain far less information with which homologies can be detected after a very short space of evolutionary time than will the same encoded protein sequence due to the redundancy of codon to amino acid translation in the genetic code. The DNA sequence can be translated here, and then BLASTP can be used to search a protein database. If your sequence is non-coding, use FASTA. You should rarely have to use BLASTN. If you want to do a preliminary check for frameshift errors in your sequence, use BLASTX to compare your sequence translated in all six reading frames against a protein database. If you want to search for a particular protein sequence in an EST database, use TBLASTN.

The BLAST pages have recently been redesigned for greater ease of use. They are divided into six sections, of which the first three are pointers to the usual suite of BLAST programs. The first section deals with nucleotide against nucleotide searches, i.e. BLASTN, and includes the MegaBLAST program, which is useful when you want to do large scale nucleotide-nucleotide analysis, such as searching the whole of dbEST (the EST, or expressed sequence tag, database) against the whole of the nucleotide non-redundant database (nr). It also includes a pointer to a search for short nearly exact matches, which is the same page as the normal BLASTN server, but the parameters have been pre-set to optimize the likelihood of finding meaningful matches with short queries, which are likely to appear in the database numerous times, just by chance.

The second section deals with protein-protein searches, and includes the BLASTP program, a page with pre-set parameters for searching with short sequences as for the BLASTN program, and also includes a link to the PSI/PHI-BLAST program, which use iterative searches with profiles to increase the programs sensitivity for finding more distantly related matches.

The third section includes the remaining suite of BLAST programs, which involve translation from a nucleotide to its protein sequence either for the query sequence, the database sequences, or both.

Each program has the same look-and-feel as the others, although some programs have more options than others. Those options common to all programs include:

  • Search sequence: this gives you a text-entry box into which you can cut and paste your query sequence (or the accession number or GI).
  • Subsequence: this allows you to choose a particular region of your query sequence to search with, instead of the whole sequence.
  • Database: this is a pull down menu with a list of the relevant databases for that program, e.g. protein only databases for BLASTP.
  • Limit by Entrez query / organism: this forces the program to return only matches which conform to your specifications, e.g. only sequences which were modified after January 2001, or only matches with a sequence length of greater than 200 residues, and that have the word 'pancreas' in the entry. The pull-down menu allows you to limit your matches to a specific organism, or a specific taxonomic class, e.g. mammalia.
  • Filter: the default is to filter the query sequence for low complexity regions, which may be common in the database, and yield statistically significant but biologically uninteresting results. The BLASTN program also allows you to filter for human repeats (although this is only useful if your query sequence is from human).
  • Expect: this is the expectation value, or how many hits you would expect to find by chance. The lower this value is, the more stringent the search becomes, and the less likely it will be that you will get completely chance matches in your results. When searching for a short, or common, motif, you should increase the expectation value (to a maximum of 1000.0). If you are only interested in very precise homologs, set the expectation value to 0.0001. The default value is 10.
  • Word size:
  • Other advanced options include:
  • Cost to open a gap [Integer]
  • Cost to extend a gap [Integer]
  • Expectation value (E) [Real]
  • Word size
  • Number of one-line descriptions (V) [Integer]
  • Number of alignments to show (B) [Integer]
  • Formatting options: these detail how the search output should look. Formatting choices can be made at the outset of the search, and can be changed later.

Each program also has its own set of specific options:

  • Matrix (and gap costs) (all BLAST programs except BLASTN): a matrix assigns a score for every possible pair of residues, giving different numbers of scoring points to matches or mismatches between amino acid residues. The choice of matrix may influence the order in which sequence hits are reported, and sometimes determine whether a hit is reported at all. The default matrix is the BLOSUM 62 matrix, which is based on observed substitutions in a database of aligned sequences where 62% of the residues are identical. The distribution of the remaining 38% is analyzed to yield the matrix. Use of this matrix will find most weak protein similarities. Other matrices, such as the BLOSUM90 or PAM30 matrices are based on sequences that are evolutionarily close; matrices such as BLOSUM30 or BLOSUM40 or PAM250 are based on evolutionarily distant sequences. Use the latter to pick up longer, more distant, homologs. The raw score of an alignment is the sum of the scores for aligning pairs of residues and the scores for gaps. Increasing the gap costs (gap opening and gap extension penalties) will result in alignments with fewer inserted gaps.
  • File (MegaBLAST): allows you to upload a file of sequences, because MegaBLAST is designed to run quickly on a large number of sequences.
  • Return alignment end points only (MegaBLAST): this returns only the start and end points of the HSPs found, and not the alignment of the HSP itself. This makes the search run slightly faster, and cuts down on the amount of output returned.
  • Percent ID (MegaBLAST): allows you to choose the percent identity threshold that HSPs must pass to be included in the output. Generally, this will be set to a high value, so that only exact, or nearly exact matches are returned.
  • Genetic codes (translated BLAST programs): Specifies which genetic code to use when translating a nucleotide sequence to amino acid. By default, the standard code is used.
  • CD Search (BLASTP, PSI-/PHI-BLAST): compares protein sequences to the Conserved Domain Database, which is a database of functional and structural domains. This can help to identify domains within a protein sequence.
  • Composition based statistics (BLASTP, PSI-/PHI-BLAST): when calculating e-values, takes into account the amino acid composition of the individual database sequences involved in reported alignments. This improves E-value accuracy, thereby reducing the number of false positive results.
  • PSSM (BLASTP, PSI-/PHI-BLAST): this allows the use of a previously constructed Position Specific Score Matrix (saved from a PSI-BLAST search). This gives you a profile of the residues at each position, and can help in finding more distantly related homologs.
  • PHI Pattern (BLASTP, PSI-/PHI-BLAST): allows the use of a pattern (generated by PHI-BLAST) to help identify distantly related homologs.

News
Jul, 2009; ChIPseeqer, a comprehensive framework for analysis of ChIP-seq data developed in the Elemento lab, is now available for download. [More]
Apr, 2009; The BDVal program developed by the Campagne laboratory for MAQC-II is now available from http://bdval.org. The software supports the development and evaluation of predictive biomarker models from high-throughput data. The web site offers binary and source distributions. [More]
Jan, 2009; Twease now supports searching MEDLINE articles by Author, Journal, and Publication Year. Examples for performing these searches can be found in the updated Twease tutorial. [More]

[News Archives] [Mailing List]


Events
Dec 11th; 4:00pm-5:00pm: Institute for Computational Biomedicine Research in Progress Seminar Series - Fabien Campagne; ICB Conference Room - Y.1301
Jan 15th; 4:00pm-5:00pm: Institute for Computational Biomedicine Research in Progress Seminar Series - Lei Shi; ICB Conference Room - Y.1301
Feb 12th; 4:00pm-5:00pm: Institute for Computational Biomedicine Research in Progress Seminar Series - Christopher E. Mason; ICB Conference Room - Y.1301
Mar 12th; 4:00pm-5:00pm: Institute for Computational Biomedicine Research in Progress Seminar Series - Olivier Elemento; ICB Conference Room - Y.1301
Apr 9th; 4:00pm-5:00pm: Institute for Computational Biomedicine Research in Progress Seminar Series - Emre Aksay; ICB Conference Room - Y.1301
May 14th; 4:00pm-5:00pm: Institute for Computational Biomedicine Research in Progress Seminar Series - Jonathan D. Victor; ICB Conference Room - Y.1301
Jun 11th; 4:00pm-5:00pm: Institute for Computational Biomedicine Research in Progress Seminar Series - Harel Weinstein; ICB Conference Room - Y.1301
Jul 9th; 4:00pm-5:00pm: Institute for Computational Biomedicine Research in Progress Seminar Series - Duane Hassane; ICB Conference Room - Y.1301