Twease/FastaTwease
From Icbwiki
Searching biological sequences with information retrieval approaches
A version of Twease that can search a corpus made of biological sequences.
I made a corpus with the ensembl human proteins (Homo_sapiens.NCBI36.41.pep.all.fa) and indexed with Textractor/MG4J with overlapping 3-grams as words.
Searched this corpus with BM25 scoring and disjunctive queries (i.e., AAP | HIL| ...) and an example query chosen at random among the sequences in the indexed corpus:
>ENSP00000331971 pep:known chromosome:NCBI36:21:30774235:30774507:-1 gene:ENSG00 000184351 transcript:ENST00000331764 MSHYGSYYGGLGYSCGGFGGLGYGYGCGCGSFCRRGYGYGSGFGSYGYGSGFGGYGYGSG
At this stage, I could not build a doc store, so the index of the sequence in the input fasta file is used as document identifier (/PMID) in the trec_eval output.
ENSP00000331971 has index 132 in the input Fasta file.
Fabien Campagne@PC120871 /cygdrive/d/dev/ensembl-data grep '>' Homo_sapiens.NCBI36.41.pep.all.fa|grep -n ENSP00000331971 132:>ENSP00000331971 pep:known chromosome:NCBI36:21:30774235:30774507:-1 gene:EN SG00000184351 transcript:ENST00000331764