Elementolab/ExpressionSignatures

From Icbwiki

Jump to: navigation, search
  • Merge name of your cell lines to every array name
Each cell lines can have one or two arrays
Each arrays contain a .pair and .tif files
  • Create a folder ‘Inputs arrays’ for all your .pair files
  • Perform an analysis of your ‘input arrays’ dataset:
Using Nimbelscann v2.5 software for Nimbelgen arrays:
Select Analysis -> Expression RMA and open the Expression RMA (Robust Multi-array Analysis) dialog box 
Click ‘Add files’ and import your pair files from the input folder
Choose the algorithm settings (log2 transformation and normalization). Nimbelscan normalize according to the quartils method.
Choose a destination folder
Click on Run RMA

Results files include the normalized gene expression values (*norm_RMA.calls) files and the probe level (*_RMA.pair) files
  • Create a new folder ‘output normalized arrays’ incuding all the *norm_RMA.calls’ files
  • Create a matrix in the format .tsv with rows (probesets) and columns (cell lines), merging all these *norm_RMA.calls’ files for every cell lines

exemple for RL1 cell line (2 arrays) (command lines):

# setwd("")	
Read the two arrays:
RL1_1 <- read.csv("RL1/folder/1_norm_RMA.calls", header=T, sep="\t")
RL1_2 <- read.csv("RL1/folder/2_norm_RMA.calls", header=T, sep="\t")

Merge the expression datas into one vector NEW_:
RL1_1_data <- RL1_1[,4]
RL1_2_data <- RL1_2[,4]
RL1_12_data <- (RL1_1_data+RL1_2_data)/2
Concatenating expression data in a new variable:
NEW_RL1 <- cbind(RL1_12_data)

Create a new variable ‘probeset’ into one or the other array:
RL1_2_probesets <- as.factor(RL1_2[,2])

Name row and columns:
row.names(NEW_RL1) <- RL1_2_probesets
colnames(NEW_RL1) <- c("RL1")
write.table(NEW_RL1, file="RL1/NEW_RL1.txt", sep="\t", quote=FALSE, row.names=T, col.names=NA)
Create a new variable with all the cell lines and write it in a .tsv format:
NEW_ALL_cell lines<-cbind (NEW_RL1, NEW_...)
dim(NEW_ALL)
write.table(NEW_ALL_cell lines, file="NEW_ALL_ cell lines _matrix.tsv", sep="\t", quote=FALSE, row.names=T, col.names=NA)
  • Open the matrix .tsv with Textedit or any other, and full the upper left case of the matrix with the word PROBESET
  • Test your matrix: check around 15 expression levels (for different cell lines and different probesets), comparing those to the means of two arrays calculated from the original .pair files
  • Check that all medians are approximately equal
m <- read.table("NEW_ALL_matrix.tsv", header=T, row.names=1)
barplot(apply(m, 2, median))
  • Unsupervised hierarchical clustering using Cluster 3.0 (command line)
cluster -f NEW_ALL_matrix_varnorm.tsv -m a -g 2
(-m a specified "average linkage", -g 2 specifies use of Pearson correlation to cluster genes)
  • Unsupervised k-means clustering (command line)
cluster -f NEW_ALL_matrix_varnorm.tsv -k 10 -g 2 -e 2 -r 1 
(-k 10 = 10 clusters, -r 1 = 1 random initialization, -e 2 = cluster columns using Pearson)
  • check gene in Wikipedia
 type gene name and wiki in Google
 http://en.wikipedia.org/wiki/NFKB1
  • translate probe sets into gene names
translate_column_using_table_column.pl --table=NEW_ALL_matrix_varnorm_K_G10.kgg --col=0 --dict=../2006-10-26_Human_60mer_1in2.ngd.HSAP_ORF --k=0 --v=1 | expression_retain_unique_lines.pl - > NEW_ALL_matrix_varnorm_K_G10.kgg.ORF.txt
  • pathway analysis (GO)
page.pl --expfile=NEW_ALL_matrix_varnorm_K_G10.kgg.ORF.txt --pathways=human_go_orf

  • pathwaY analysis (KEGG
page.pl --expfile=NEW_ALL_matrix_varnorm_K_G10.kgg.ORF.txt --pathways=kegg --suffix=KEGG
  • extract genes in given pathway and cluster
find_genes_in_bin_and_pathway.pl --expfile=NEW_ALL_matrix_varnorm_K_G10.kgg.ORF.txt.KEGG --pathway=hsa04662 --bin=6
PIK3CD	NM_005026	phosphoinositide-3-kinase, catalytic, delta	6
INPP5D	NM_001017915/NM_005541	SH2 containing inositol phosphatase isoform a/SH2 containing inositol phosphatase isoform b	6
GSK3B	NM_002093	glycogen synthase kinase 3 beta	6
  • Open a new command file:
Source ()
setwd("/Users/thomasclozel/Desktop/Epigenetics /Arrays/GE ARRAYS/outputs GE cell lines corr/")
m <- read.table("NEW_ALL_matrix.tsv", header=TRUE, row.names=1)
  • Analyse in the matrix 1 the gene FOXP1 (probeset number found in the text file) in creating this vector:
m["HSAP0406S00007126",]
  • Draw the vertical barplot to see the distribution after precision that it is a vector, sorted or not:
barplot(as.numeric(m["HSAP0406S00007126",]))
barplot(sort(as.numeric(m["HSAP0406S00007126",]))
colnames(m["HSAP0406S00007126",])
cn <- colnames(m["HSAP0406S00007126",])
barplot(as.numeric(m["HSAP0406S00007126",]), names.arg=cn)
  • Create a matrix Fox P1 and cut it into two groups:
foxp1 <- m["HSAP0406S00007126",]
foxp1[1:5]
foxp1[6:10]
  • Compare the distribution of the two groups for this gene with a t-test:
 t.test(foxp1[1:5], foxp1[6:10])
  • Test the correlations between two variables of the matrix (here: rows=probests) and make a Peasron correlation test:
cor(as.numeric(m[2,]), as.numeric(m[34,]))
cor.test(as.numeric(m[2,]), as.numeric(m[34,]))
  • Create a matrix with the ten sensible and resistant cell lines, put the names of each variables btw "", if not each variable is considered as a vector; use c to bind variables (rows or columns) and not bind, reserved for vectors; check for the column names:
SENRES <- m[,c("", "DOHH2","NUDUL1","RC.K8","SUDHL.7","SUDHL.8")]

colnames(m)

  • Analyse FOXP1 in this matrix, and compare the distribution between the two groups:
FOXP1SENRES<-SENRES[HSAP0406S00007126,]
t.test(FOXP1SENRES[1:5], FOXP1SENRES[6:10])
  • Analyse NFKB pathway with corresponding probesets:
NFKBSENRES<- SENRES[c("HSAP0406S00012216","HSAP0406S00012217","HSAP0406S00016182","HSAP0406S00018946","HSAP0406S00024378","HSAP0406S00032100" ,"HSAP0406S00032101","HSAP0406S00032102","HSAP0406S00032228","HSAP0406S00032229"), ]
t.test(NFKBSENRES[1:5], NFKBSENERZ[6:10])


  • Apply the t-test to the whole matrix SENRES, 1 for columns, 2 for rows, and create a new vector pp:
my.t.test <- function(x) { t.test(x[1:5], x[6:10])$p.value }
pp <- apply(SENRES, 1, my.t.test)
  • Study for a alpha-risk of 5% the values in creating a new numeric vector (no name) who will select the values <0.05 applied to pp:
pp[pp<0.05]
summary (pp)
  • Create a FALSE/TRUE vector pft non applied to pp:
pft <- pp<0.05
length (pft)
  • Write the vector, all the FALSE values disappear by themselves:
pft
  • Find which valors are true in combining the numerical and the TRUE vector
pp[pft]
  • The alpha risk of 5% does not work for length (pp) vector, a correction is essential. The first is to divide alpha by the number of tests realized: 0,05/ (length(pp))pp<(0.05/36487)

summary (pp<(0.05/36487))

pft<- pp<(0.05/36487)
pp[pft]
  • the second is the Benjamini-Hochberg correction for a percentage of False positive chosen between 10% or more.
pv <- p.adjust(pp, method="BH")
pv
pv<0.1
summary(pv<0.1)
Personal tools