Twease/FastaTwease
From Icbwiki
Revision as of 20:27, 13 October 2006 Fabien Campagne (Talk | contribs) ← Previous diff |
Revision as of 20:42, 13 October 2006 Fabien Campagne (Talk | contribs) (Verification) Next diff → |
||
Line 12: | Line 12: | ||
At this stage, I could not build a doc store, so the index of the sequence in the input fasta file is used as document identifier (/PMID) in the trec_eval output. | At this stage, I could not build a doc store, so the index of the sequence in the input fasta file is used as document identifier (/PMID) in the trec_eval output. | ||
- | ENSP00000331971 has index 132 in the input Fasta file. | + | ENSP00000331971 has index 132-1=131 in the input Fasta file (sequence numbers start at 0 with textractor, but grep numbers lines from 1). |
Fabien Campagne@PC120871 /cygdrive/d/dev/ensembl-data | Fabien Campagne@PC120871 /cygdrive/d/dev/ensembl-data | ||
Line 18: | Line 18: | ||
132:>ENSP00000331971 pep:known chromosome:NCBI36:21:30774235:30774507:-1 gene:EN | 132:>ENSP00000331971 pep:known chromosome:NCBI36:21:30774235:30774507:-1 gene:EN | ||
SG00000184351 transcript:ENST00000331764 | SG00000184351 transcript:ENST00000331764 | ||
+ | |||
+ | The best matches in trec_eval format are: | ||
+ | |||
+ | 100 Q0 131 0 42.46582283201506 twease | ||
+ | 100 Q0 148 1 28.144966050948458 twease | ||
+ | 100 Q0 133 2 25.34762250828596 twease | ||
+ | 100 Q0 135 3 23.00047774168498 twease | ||
+ | 100 Q0 141 4 22.69864217728251 twease | ||
+ | 100 Q0 134 5 22.67280744193289 twease | ||
+ | 100 Q0 139 6 22.358672780465746 twease | ||
+ | 100 Q0 144 7 22.312434886092404 twease | ||
+ | 100 Q0 132 8 21.778693703352136 twease | ||
+ | 100 Q0 136 9 21.145654858966886 twease | ||
+ | |||
+ | Therefore, the query sequence is found at rank 0 (best). That's a good start. The next best match is the sequence at index 148 in the input file. | ||
+ | |||
+ | $ grep '>' Homo_sapiens.NCBI36.41.pep.all.fa |head -149 |tail -1 | ||
+ | >ENSP00000335566 pep:known-ccds chromosome:NCBI36:21:31049328:31049567:-1 gene:ENSG00000187005 transcript:ENST00000335093 CCDS13606.1 | ||
+ | Verification: | ||
+ | $ grep '>' Homo_sapiens.NCBI36.41.pep.all.fa|grep -n ENSP00000335566 | ||
+ | 149:>ENSP00000335566 ... | ||
+ | |||
+ | ENSP00000335566 is a novel gene, but marked as belonging to the [http://www.ensembl.org/Homo_sapiens/familyview?family=ENSF00000003793 keratin family] (ENSF00000003793 : KERATIN). | ||
+ | |||
+ | The query sequence is [http://www.ensembl.org/Homo_sapiens/protview?peptide=ENSP00000335566 Keratin-associated protein 21-1] marked as a member of the keratin family (ENSF00000003793). |
Revision as of 20:42, 13 October 2006
Searching biological sequences with information retrieval approaches
A version of Twease that can search a corpus made of biological sequences.
I made a corpus with the ensembl human proteins (Homo_sapiens.NCBI36.41.pep.all.fa) and indexed with Textractor/MG4J with overlapping 3-grams as words.
Searched this corpus with BM25 scoring and disjunctive queries (i.e., AAP | HIL| ...) and an example query chosen at random among the sequences in the indexed corpus:
>ENSP00000331971 pep:known chromosome:NCBI36:21:30774235:30774507:-1 gene:ENSG00 000184351 transcript:ENST00000331764 MSHYGSYYGGLGYSCGGFGGLGYGYGCGCGSFCRRGYGYGSGFGSYGYGSGFGGYGYGSG
At this stage, I could not build a doc store, so the index of the sequence in the input fasta file is used as document identifier (/PMID) in the trec_eval output.
ENSP00000331971 has index 132-1=131 in the input Fasta file (sequence numbers start at 0 with textractor, but grep numbers lines from 1).
Fabien Campagne@PC120871 /cygdrive/d/dev/ensembl-data grep '>' Homo_sapiens.NCBI36.41.pep.all.fa|grep -n ENSP00000331971 132:>ENSP00000331971 pep:known chromosome:NCBI36:21:30774235:30774507:-1 gene:EN SG00000184351 transcript:ENST00000331764
The best matches in trec_eval format are:
100 Q0 131 0 42.46582283201506 twease 100 Q0 148 1 28.144966050948458 twease 100 Q0 133 2 25.34762250828596 twease 100 Q0 135 3 23.00047774168498 twease 100 Q0 141 4 22.69864217728251 twease 100 Q0 134 5 22.67280744193289 twease 100 Q0 139 6 22.358672780465746 twease 100 Q0 144 7 22.312434886092404 twease 100 Q0 132 8 21.778693703352136 twease 100 Q0 136 9 21.145654858966886 twease
Therefore, the query sequence is found at rank 0 (best). That's a good start. The next best match is the sequence at index 148 in the input file.
$ grep '>' Homo_sapiens.NCBI36.41.pep.all.fa |head -149 |tail -1 >ENSP00000335566 pep:known-ccds chromosome:NCBI36:21:31049328:31049567:-1 gene:ENSG00000187005 transcript:ENST00000335093 CCDS13606.1 Verification: $ grep '>' Homo_sapiens.NCBI36.41.pep.all.fa|grep -n ENSP00000335566 149:>ENSP00000335566 ...
ENSP00000335566 is a novel gene, but marked as belonging to the keratin family (ENSF00000003793 : KERATIN).
The query sequence is Keratin-associated protein 21-1 marked as a member of the keratin family (ENSF00000003793).