From Icbwiki

Revision as of 20:42, 13 October 2006; view current revision
←Older revision | Newer revision→
Jump to: navigation, search

Searching biological sequences with information retrieval approaches

A version of Twease that can search a corpus made of biological sequences.

I made a corpus with the ensembl human proteins (Homo_sapiens.NCBI36.41.pep.all.fa) and indexed with Textractor/MG4J with overlapping 3-grams as words.

Searched this corpus with BM25 scoring and disjunctive queries (i.e., AAP | HIL| ...) and an example query chosen at random among the sequences in the indexed corpus:

>ENSP00000331971 pep:known chromosome:NCBI36:21:30774235:30774507:-1 gene:ENSG00 000184351 transcript:ENST00000331764 MSHYGSYYGGLGYSCGGFGGLGYGYGCGCGSFCRRGYGYGSGFGSYGYGSGFGGYGYGSG

At this stage, I could not build a doc store, so the index of the sequence in the input fasta file is used as document identifier (/PMID) in the trec_eval output.

ENSP00000331971 has index 132-1=131 in the input Fasta file (sequence numbers start at 0 with textractor, but grep numbers lines from 1).

 Fabien Campagne@PC120871 /cygdrive/d/dev/ensembl-data
 grep '>' Homo_sapiens.NCBI36.41.pep.all.fa|grep -n ENSP00000331971
 132:>ENSP00000331971 pep:known chromosome:NCBI36:21:30774235:30774507:-1 gene:EN
 SG00000184351 transcript:ENST00000331764

The best matches in trec_eval format are:

 100	Q0	131	0	42.46582283201506	twease
 100	Q0	148	1	28.144966050948458	twease
 100	Q0	133	2	25.34762250828596	twease
 100	Q0	135	3	23.00047774168498	twease
 100	Q0	141	4	22.69864217728251	twease
 100	Q0	134	5	22.67280744193289	twease
 100	Q0	139	6	22.358672780465746	twease
 100	Q0	144	7	22.312434886092404	twease
 100	Q0	132	8	21.778693703352136	twease
 100	Q0	136	9	21.145654858966886	twease

Therefore, the query sequence is found at rank 0 (best). That's a good start. The next best match is the sequence at index 148 in the input file.

 $ grep '>' Homo_sapiens.NCBI36.41.pep.all.fa |head -149 |tail -1
 >ENSP00000335566 pep:known-ccds chromosome:NCBI36:21:31049328:31049567:-1 gene:ENSG00000187005 transcript:ENST00000335093 CCDS13606.1
 $ grep '>' Homo_sapiens.NCBI36.41.pep.all.fa|grep -n ENSP00000335566
 149:>ENSP00000335566 ...

ENSP00000335566 is a novel gene, but marked as belonging to the keratin family (ENSF00000003793 : KERATIN).

The query sequence is Keratin-associated protein 21-1 marked as a member of the keratin family (ENSF00000003793).

Personal tools