AtaxiaGenes

From Icbwiki

Jump to: navigation, search

To align solexa fasta format with adapter sequences

./ssearch34 -E 10000 -m 9 s1.fasta adapter.fasta -Q -d 1 -b 1 |./fastam9_table.pl


OPTIONS

fasta and the other programs can be directed to change the scoring matrix, search parameters, output format, and default search directories by entering options on the command line (preceeded by a `-' or `/' for MS-DOS). All of the options should preceed the file name and ktup arguments). Alternately, these options can be changed by setting environment variables. The options and environment variables are:

-1

   Normally, the top scoring sequences are ranked by their initn score. By using the -1 option, sequences are ranked by their init1 score.

-a

   (SHOWALL) Modifies the display of the two sequences in alignments. Normally, both sequences are shown only where they overlap (SHOWALL=0); If -a or the environment variable SHOWALL = 1, both sequences are shown in their entirety.

-b #

   The number of similarity scores to be shown when the -Q option is used. This value is usually calculated based on the actual scores.

-c #

   (OPTCUT) The threshold for optimization with the -o option. The OPTCUT value is normally calculated based on sequence length.

-d #

   The number of alignments to be shown. Normally, fasta shows the same number of alignments as similarity scores. By using fasta -Q -b 200 -d 50, one would see the top scoring 200 sequences and alignments for the 50 best scores.

-f #

   Penalty for the first residue in a gap (-12 by default).

-g #

   Penalty for additional residues in a gap (-2 by default).

-h

   Do not display histogram of similarity scores.

-k #

   (GAPCUT) Sets the threshold for joining the initial regions for calculating the initn score.

-l #

   (FASTLIBS)The name of the library menu file. Normally this will be determined by the environment variable FASTLIBS. However, a library menu file can also be specified with -l.

-m #

   (MARKX) =0,1,2,3,4. Alternate display of matches and mismatches in alignments. MARKX=0 uses ":","."," ", for identities, consevative replacements, and non-conservative replacements, respectively. MARKX=1 uses " ","x", and "X". MARKX=2 does not show the second sequence, but uses the second alignment line to display matches with a "." for identity, or with the mismatched residue for mismatches. MARKX=2 is useful for aligning large numbers of similar sequences. MARKX=3 writes out a file of library sequences in FASTA format. MARKX=3 should always be used with the "SHOWALL" (-a) option, but this does not completely ensure that all of the sequences output will be aligned. MARKX=4 displays a graph of the alignment of the library sequence with repect to the query sequence, so that one can identify the regions of the query sequence that are conserved.

-n

   Forces the query sequence to be treated as a DNA sequence.

-o

   Causes fasta to perform a limited optimization on all of the sequences in the library with initn scores greater than OPTCUT. This slows the program down about 5-fold, but, when combined with ktup=1, provides an extremely sensitive sequence comparison.

-Q

   Quiet option. This allows fasta and tfasta to search a database and report the results without asking any questions. fasta -Q file library > output can be put in the background or run at a later time with the unix 'at' command. The number of similarity scores and alignments displayed with the -Q option can be modified with the -b (scores) and -d (alignments) options.

-r

   (STATFILE) Causes fasta to write out the sequence identifier, superfamily number (if available), and similarity scores to STATFILE for every sequence in the library. These results are not sorted.

-s str

   (SMATRIX) the filename of an alternative scoring matrix file. For protein sequences, BLOSUM50 is used by defualt; PAM250 can be used with the command line option -s 250.

-v str

   (LINEVAL) (plfasta only) plfasta and pclfasta can use up to 4 different line styles to denote the scores of local alignments. The scores that correspond to these line styles can be specified with the environment variable LINVAL, or with the -v option. In either case, a string with three numbers separated by spaces should be given. This string must be surrounded by double quotation marks. For example, LINEVAL="200 100 50" tells plfasta to use solid lines for local alignments with scores greater than 200 long dashed lines for scores between 100 and 200, short dashed lines for scores between 50 and 100, and dotted lines for scores less than 50. The equivalent command line specification is plfasta -v "200 100 50" Normally, the values are 200, 100, and 50 for protein sequence comparisons and 400, 200, and 100 for DNA sequence comparisons.

-w #

   (LINLEN) output line length for sequence alignments. (normally 60, can be set up to 200).

-x "offset1 offset2"

   Causes fasta/lfasta/plfasta to start numbering the aligned sequences starting with offset1 and offset2, rather than 1 and 1. This is particularly useful for showing alignments of promoter regions.

-y #

   Set the bandwidth used for optimization. -y 16 is the default for protein when ktup=2 and for all DNA alignments. -y 32 is used for protein and ktup=1. For proteins, optimization slows comparison 2-fold and is highly recommended.

-z

   Do not do statistical significance calculation.

-3

   tfasta only. Normally tfasta translate sequences in the DNA sequence library in all six frames. With the -3 option, only the three forward frames are searched. 



Ataxia causing genes are found in a number of genetic loci identified in the human genome.

We assemble a library of proteins found at loci for ataxia causing genes. The loci are defined as a 10Mb window +-2Mb around the ataxia causing gene. The variation is meant to convey the relative error of mapping data.

Recode each protein/transcript to encode the source locus.

See textractor.tools.SequenceLibraryTagger

Produce protein similarity distances:

./ssearch34 -E 100 -s idn_10_3.mat -m 9 -d 0 ataxia-v2.fasta ataxia-v2.fasta -g -20 | ./fastam9_table.pl > ataxia-all-vs-all-v3.tab
cut -f 1,2,11,12 ataxia-all-vs-all-v3.tab > reduced-v3.tab

See bioperl for fastam9_to_table.PLS script http://cvs.bioperl.org/cgi-bin/viewcvs/viewcvs.cgi/bioperl-live/scripts/searchio/fastam9_to_table.PLS?rev=HEAD&cvsroot=bioperl&content-type=text/vnd.viewcvs-markup


reduced.tab should look like this:

 ENST00000283309:L6_2    ENST00000283309:L6_2    2.2e-207
 ENST00000283309:L6_2    ENST00000336070:L6_2    8.7e-145
 ENST00000283309:L6_2    ENST00000358587:L6_2    1.9e-114
 ENST00000283309:L6_2    ENST00000366797:L6_2    1.3e-83
 ENST00000283309:L6_2    ENST00000366798:L6_2    1.3e-83
 ENST00000283309:L6_2    ENSP00000350749:L4_1    1.2e-18
 ENST00000283309:L6_2    ENSP00000349909:L4_1    1.2e-18
 ENST00000283309:L6_2    ENSP00000322675:L4_1    2.8e-18
 ENST00000283309:L6_2    ENST00000343115:L11_1   1.3e-09

Other scripts useful for this project

To output scores instead of E-values:

 cut -f 1,2,12 ataxia-all-vs-all-v3.tab >reduced-v5.tab

To remove loci that incorrectly were built with protein ids:

 grep -v L6_2 reduced-source.tab |grep -v L3_1 |grep -v L4_1 |grep -v L5_1>reduced-filtered.tab
Personal tools