One of the most widely used bioinformatics tools involves
searching a database for sequences similar to a query sequence.
If two sequences are similar, then it is likely that they
have a similar structure, and therefore perhaps similar
functions. Thus, if you have an as-yet-uncharacterized sequence,
finding homologs in the databases can give you an idea of what
its identity might be.
(Find protein sequences at NCBI.)
There are several different algorithms for implementing a
homology search, and each program has a wide range of options
and parameters to help you carry out a more informative type of
search. The algorithm that gives the most exact and informative
matches is the Smith-Waterman algorithm, and was also the first
homology searching algorithm developed. However, this program
cannot be used with large databases because the algorithm is so
labor-intensive that it becomes unfeasibly slow. The Smith-Waterman
algorithm is most usefully used with protein databases, of smaller sizes, such as SwissProt.
The most
commonly used program is the BLAST family of programs, which
gives biologically meaningful matches in a reasonable amount of
time. However, the FASTA program, although usually a little
slower, is almost always more sensitive than BLAST when using a
DNA sequence to query a DNA database.
The BLAST, FASTA and Smith-Waterman (MPSrch) servers are used to find homologs of
the query sequence in the databases to aid in the identification of the query sequence.
The InterProScan server, on the other hand, searches the InterPro database to
identify chunks of protein that may encode some function. The InterPro database is
an integrated resource that stores
information about protein families, domains, repeat regions and other functional sites from the
most commonly used signature databases.
|