Elementolab/R tutorial

From Icbwiki

(Difference between revisions)
Jump to: navigation, search
Revision as of 15:31, 26 March 2012
Ole2001 (Talk | contribs)
(Statistical tests, LIMMA, etc)
← Previous diff
Revision as of 15:33, 19 April 2012
Ole2001 (Talk | contribs)
(R tutorials)
Next diff →
Line 588: Line 588:
http://bioinf.wehi.edu.au/limma/ http://bioinf.wehi.edu.au/limma/
 +
 +* R and GSEA
 +
 + http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/R-GSEA_Readme

Revision as of 15:33, 19 April 2012

Contents

Download, build, call R

  • Downloading R
http://www.r-project.org/
  • Calling R from PERL
http://search.cpan.org/~gmpassos/Statistics-R-0.02/lib/Statistics/R.pm
  • To build R from source under Mac OS X
1. Download R source code
2. Download mandatory compilers (GNU C++ and Fortran)
   http://r.research.att.com/tools/
3. Unpack your R source code
   tar xvfz R-2.10.1.tar.gz
4. Configure, make and install. 
   NOTE: you have to specify the -arch option for F77 and FC otherwise configure won't know the architecture and won't work. 
cd R-2.10.1
r_arch=`arch`       
R_TGDIR=/path/to/your/destination/dir/
./configure --prefix=$R_TGDIR r_arch=$arch CC="gcc -arch $arch"   \
       CXX="g++ -arch $arch" F77="gfortran -arch $arch"           \
       FC="gfortran -arch $arch" OBJC="gcc -arch $arch"           \
       --x-includes=/usr/X11/include --x-libraries=/usr/X11/lib   \
       --with-blas='-framework vecLib' --with-lapack
make
make install
* details: http://cran.r-project.org/bin/macosx/RMacOSX-FAQ.html#Building-R-from-sources
  • To make the standalone RMath library, make sure you finished building R using the above steps, and then:
cd R-2.10.1/src/nmath/standalone
make 
make install
  • Installing R packages/libraries locally (eg on a Unix server)
mkdir $HOME/R/library
export R_LIBS_USER=$HOME/R/library
wget http://cran.r-project.org/src/contrib/SparseM_0.86.tar.gz
R CMD INSTALL SparseM_0.86.tar.gz  

Vectors and matrices

  • vector creation
v <- c(7,5,2,9)
v <- scan()
  • vector manipulation
sqrt(v)     # applies sqrt to all elements in vector
v[1]        # returns element 1
v[1:3]      # returns element 1 to 3
v[c(1,4)]   # returns elements 1 and 4 only
v[-3]       # returns all elements except 3
  • sorting
sort(v)     # sorts the content of the vector
order(v)    # sorts the vector, then returns the position of each element in the original vector
  • conditional operators
v<2         # shows which elements of v are less than 2
v==9        # shows which elements are equal to 9
v<2 & v>0   # shows which elements are between 0 and 2 (non-included) 
v[v<2]      # shows the elements which are less than 2
  • Shuffling a vector
sample(mat[1,])
  • a simple function to quantize a vector using equally populated bins
quantize <- function(x, bins=10) {
  mu <- 100 / bins
  binints <- quantile(x,  probs = mu*c(1:bins)/100)
  binints[bins] <- binints[bins] + 1
  n <- length(x)
  o <- rep(-1, n)
  for (i in 1:n) {    
    for (k in 1:bins) {
      if (x[i] <= binints[k]) {
        o[i] <- k
        break
      }
    }    
  }
  o
}
  • matrix input from disk
mat <- read.csv("testmat.txt", sep="\t", row.names=1, header=T) # reads in a TSV formatted table, with column and row names
# use check.names = FALSE to keep your headers unchanged
  • save a matrix to disk
write.table(mat, file="file.txt", sep="\t", quote=F, row.names=T, col.names=NA)
# use format(mat, digits=2, trim=T) in order to restrict the floating point precision of the output matrix
  • basic matrix manipulation
dim(mat) # returns dimension of matrix (n x m, where n is the number of rows)
mat[1,4]  # returns entry 1,4 of mat
mat[5,]  # returns the 5th row of mat
mat[,10] # returns the 10th column of mat
t(mat) # transpose mat
  • merge two vectors/matrices/data.frames by rownames
ME <- cbind(mat1, mat2[row.names(mat1), ]) #missing values will be labeled "NA"
# use MEnoNA <- ME[ !is.na(ME[,2]), ] to remove NAs
  • Eisen heatmap visualization of a matrix
(For users on Windows (win32), please be sure that you have installed the made4 package; instructions.)
library(made4)  # do once, at the beginning of a session 
heatplot(as.matrix(mat), Rowv=NA, Colv=NA)
  • apply a function to all rows or all cols of a matrix
apply(mat, 1, mean)  # returns a vector containing the mean of each row
apply(mat, 1, scale) # variance-normalize rows of a matrix
apply(mat, 2, mean)  # return mean of each column

Clustering

  • hierarchical clustering
 d <- dist(mat)        # calculate a distance matrix (stored in d)
 d <- as.dist(mat)     # convert a matrix (mat) to the distance matrix format (stored in d)
 h <- hclust(d, method="complete")   # build tree based on distance matrix
 plot(h)               # plot tree (note that the terminal branch have meaningless lengths)
 plot(h, hang=-1)      # plot tree (gives the terminal branches the correct length)
 str(as.dendrogram(h)) # output text tree with distances

More: statistical evaluation of hierarchical clustering - http://www.is.titech.ac.jp/~shimo/prog/pvclust/ - http://www.statmethods.net/advstats/cluster.html

  • k-means
 km <- kmeans(mat, 3)  # k-means with 3 clusters; stores the results in variable km
 km <- kmeans(mat, 3, nstart=10, iter.max=30); # same with additional parameters : number of runs, maximum number of iterations, etc.
 mat[km$cluster==1,] # select the first cluster
  • filter genes based on variance
s <- apply(mat, 1, sd)   # get std dev
summary(s>0.5)           # see how many genes have standard deviation greater than 0.5
vmat <- mat[s>0.5,]      # keep only genes that have variation > 0.5

Plots

  • how to do a boxplot
boxplotmat <- read.csv("boxplotmat.txt", sep="\t", row.names=1, header=T, check.names=FALSE) # this matrix should have the longest column 1st

summary(boxplotmat)   # allows you to see the contents of your matrix 
pc1 <- boxplotmat[,1] # to save the first column of your matrix. This will allow you to select which columns you wish to include/exclude from the boxplot

boxplot(pc1,pc2...,pc8, outline = FALSE, names= c("\pc1 label\", "\pc2 label\",...,"\pc8 label\"),main="\boxplot title\", col=c("red","green"))
        
        #outline=FALSE will omit drawing the outliers (to include outliers use option \=TRUE\)
        #make sure that the labels specified in names= c("\pc1 label\", "\pc2 label\") match the order (and number) of data vectors plotted (/pc1,pc2/ in this case)
        #notch=TRUE draws a notch in each side of the boxes. If the notches of two plots do not overlap this is ‘strong evidence’ that the two medians differ(Chambers et al,1983,p62)

  • how to do a basic plot
plot(x = row.names(mat),y = mat[,1], xlab = "\x-axis label\", ylab = "\y-axis label\", type = "p", main = "\title of graph\") 
#type = "h" (histogram),"p" (points), "s" (stepwise)
  • plot a profile with error bars
plot.errorbars <- function(a,s) {
 m <- length(a)
 plot(1:m, a, type="l")
 segments(1:m, a, 1:m, a+s)
 segments(1:m, a, 1:m, a-s)
 segments(c(1:m)-0.2, a+s, c(1:m)+0.2, a+s)
 segments(c(1:m)-0.2, a-s, c(1:m)+0.2, a-s)
}
# assume that as is a vector of average values, st is a vector of standard deviations
plot.errorbars(av, st)
  • sort the genes according to the clustering, then draw a heat map
 km <- kmeans(vmat, 10)  # cluster
 vmat_sorted <- mat[order(km$cluster), ] 
 heatplot(vmat_sorted, Rowv=NA, Colv=NA)
  • barplot()
use las=2 to show labels vertically 
  • multiple plots and legend
#create extra margin room on the right for the legend 
par(oma = c(0, 0, 0, 5))
#plot all the data together
plot(colMeans(cluster1), type="l", col=2, xlab="TSS", ylab="Avg. density", main="Avg. density per cluster")
points(colMeans(cluster2), type="l", col=3)
points(colMeans(cluster3), type="l", col=4)
points(colMeans(cluster4), type="l", col=5)
points(colMeans(cluster5), type="l", col=6)
points(colMeans(cluster6), type="l", col=1)
#allow the legend to print outside the plot area
par(xpd=NA)
#show the legend
legend("topright", inset = c(-.28, .0), c("cluster1", "cluster2", "cluster3", "cluster4", "cluster5", "cluster6"), col=c(2,3,4,5,6,1), lty=1)

Statistical tests, LIMMA, etc

  • finding differentially expressed genes using the t-test
t.test(v1, v2)   # does a two-tailed t-test
t.test(v1, v2, alternative="greater")  
t.test(v1, v2)$p.value # return only p-value for t-test
  • apply t-test to all rows of a matrix
my.t.test <- function(x) { t.test(x[1:14], x[15:27])$p.value }
my.t.test(a[2,])     # equivalent to t.test()
p <- t(apply(a, 1, my.t.test))
  • To correct for multiple testing
 pa <- p.adjust(p, method="BH")
  • To find all rows with p-values such that FDR=5%
 a[pa<0.05,]
 
  • In fact, it can all be done in one line, without function naming:
p <- apply(a, 1, function(x) { t.test(x[1:14], x[15:27])$p.value } )
as.matrix(p)  # because apply produces a row and not a column. t(p) would work too

Paired version, returns t-statistic and p-value

pt <- apply(a, 1, function(x) { tt<-t.test(x[1:14], x[15:27], paired=T); c(tt$statistic, tt$p.value) } )
pt <- t(pt)
a[ pt[,2]<0.05, ]  # subset of a with t-test pvalue < 0.05
  • sort rows by p-value
heatplot(sclero[ order(p, decreasing=T), ], Rowv=NA, Colv=NA)
  • analysis of variance
measures <- c( .. )  # vector with 12 elements
groups <- c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3) # define group membership for each measure
myaov <- aov( measures ~ as.factor(groups) )
summary( myaov )
anova(myaov)$'Pr(>F)'

  • Calculating the correlation between two random variables
cor.test() 
cor(mat[1,], mat[,2])                    #can also plot using: plot(mat[,1], mat[,2])
cor(mat[1,], mat[,2], method="spearman") #Spearman's rank correlation rho
cor(mat[1,], mat[,2], method="kendall")  #Kendall's rank correlation tau
 
pcor.test(x, y, z);   # calc cor(x,y) given z
   estimate      p.value statistic    n gn  Method            Use
1 -0.1424801 2.843382e-09 -5.940394 1706  1 Pearson Var-Cov matrix


  • Fisher's test
fisher.test(matrix( c(28, 16, 4 , 1), nrow=2, ncol=2))
  • Estimate sample size requirements for a study:
SampleSizeProportions-package Package documentation

propdiff.freq(len, p1.estimate, p2.estimate, level = 0.95)     #len= length of the confidence interval desired 
                                                               #p1.estimate= A point estimate for the binomial proportion for the first population
                                                               #p2.estimate= A point estimate for the binomial proportion for the second population
                                                               #level= confidence level desired (len=0.95 typically)
                                                               #Read more:[http://www.ehow.com/how_5212437_calculate-sample-size-proportion.html#ixzz1IaMaQlXH How to calculate sample
                                                                for Proportion]
  • Differential expression using LIMMA

Assume we have 3 WT samples and 3 unpaired mutant samples (labelled dd) and you did RNA-seq on these samples.

GENE	Description	WT1	WT2	WT3	dd1	dd2	dd3
Akirin1	akirin-1	56.18	61.31	89.48	53.54	63.38	91.83
Akirin2	akirin-2	254.91	243.65	244.12	274.66	233.75	272.08
Akna	AT-hook-containing transcription factor	10.29	11.50	16.27	14.04	14.94	11.02
Aknad1	AKNA domain containing 1	0.00	0.00	0.00	0.00	0.00	0.00
Akp3	intestinal-type alkaline phosphatase precursor	0.00	0.00	0.00	0.00	0.00	0.00
Akr1a4	alcohol dehydrogenase [NADP+]	1197.67	1354.26	1264.58	1260.24	1361.56	1393.50
...

How do we determine which genes are differentially expressed using LIMMA ?

1. create a file called design.txt that will contaim the LIMMA design matrix (tab separated columns).

S       W       WTvsdd  
WT1     1       1
WT2     1       1 
WT3     1       1
dd1     1       0
dd2     1       0
dd3     1       0

2. use LIMMA

library(limma)
design <- read.table("design.txt", header=T, row.names=1)
esettmp <- read.csv("Table_genes.txt", header=T, row.names=1, check.names=F, sep="\t")
eset <- esettmp[,-1]
varfilter <- apply(eset, 1, sd)>0.5
eset <- eset[varfilter,]
esettmp <- esettmp[varfilter,]
leset <- log2(eset+1)
fit <- lmFit(leset, design)
fit <- eBayes(fit)
oo <- topTable(fit, coef="WTvsdd", adjust="BH", number=40000)
write.table(cbind(esettmp[oo[,1],], oo), file="Table_genes_LIMMA_log2.txt", sep="\t", quote=FALSE, col.names=NA)

Principal Component Analysis

  • Principal component analysis (PCA) and biplot
p <- princomp(m, cor=T)
biplot(p, pc.biplot=T, cex=0.7, expand=3)
  • The loadings (arrows) don't always need to be plotted, in which case, one has to revert to a monoplot
p<-prcomp(t(sm), scale=T, cor=T)
lam <- sqrt(NROW(p$x)) * p$sdev[1:2]
plot(p$x[,1:2]/lam, pch=".", col="white", xlim=c(-0.75,0.75), ylim=c(-0.75, 0.75))
text(p$x[,1:2]/lam, row.names(p$x), pos=NULL)
  • More info: the prcomp function returns an object with 3 slots
p$sdev       # contains the square root of the eigenvalues
p$rotations  # contains the eigenvectors 
p$x          # contains the projected data
  • PCA using ADE4 module
https://stat.ethz.ch/pipermail/bioconductor/2003-November/003132.html
  • 3D PCA using rgl
library(rgl)
pca <- princomp(mmir, cor=T)
plot3d(pca$scores [,1:3],xlab="Component 1",main="My 3D PCA",ylab="Component 2", zlab="Component 3", box=F, axes=T) 
spheres3d(pca$scores, radius=0.5, col="red")
text3d(pca$scores[,1:3], text=rownames(pca$scores), adj=1.3, cex=0.8)  # adj controls how close the text is to the points, cex controls text size

Multi-dimensional Scaling

  • multi-dimensional scaling (MDS)
library(MASS)
im <- isoMDS(as.dist(1-cor(t(m))), maxit=1000)
eqscplot(im$points, tol=0.3, pch="+")
text(im$points, row.names(t(m)))

  • Sammon's mapping is another type of MDS:
im <- sammon(as.dist(1-cor(t(m))))

Regression and cross validation

fit <- glm(z ~ x + y,  family=binomial(link="logit"))
predict(fit, data.frame(x=mean(x), y=1), type='response')

more info http://www.stat.nthu.edu.tw/~swcheng/Teaching/stat5230/lab/02_BinomialData_EstimationProblem.html

  • To evaluate the contribution of individual predictors:
drop1(fit, test="Chi")
  • LASSO regression
library(lars)
# fits
TP4.lars <- lars( m, TP4, type="lasso")
# determine s by cross-val
TP4.cv <- cv.lars( m, TP4, type="lasso")
TP4.s  <- TP4.cv$fraction[order(TP4.cv$cv)[1]]
# evaluate fit using same data
cor.test ( predict(TP4.lars, m, type="fit", mode="fraction", s=TP4.s)$fit, TP4)
# print coefficients
coef(TP4.lars, mode="fraction", s=TP4.s)
  • Elastic Net regression (e=response, m=predictors)
 library(lars)
 library(elasticnet)
 lambda <- 0.5
 e.cv <- cv.enet( m, e, lambda, s=seq(0,1,length=100), mode="fraction", trace=F, K=10)
 s <- e.cv$s[order(e.cv$cv)[1]]
 e.fits <- enet( mnona, enona, lambda=lambda)
 # regression coefficients
 as.matrix(predict(e.fits, s=s, type="coefficients", mode="fraction")$coefficients)
 # predict and evaluate fit
 cor.test (predict(e.fits, m, type="fit", mode="fraction", s=s)$fit, e)	
  • faster LASSO/ElasticNet regression using the very fast glmnet library
 library(glmnet)
 e.cv  <- cv.glmnet( m, e, nfolds=10)
 e.l   <- e.cv$lambda.min
 alpha <- 0.5 # 1.0 = LASSO, use only >=0.01
 # fit
 e.fits <- glmnet( m, e, family="gaussian", alpha=alpha, nlambda=100)
 # predict and compare to training data
 cor.test( predict(e.fits, m, type="response", s=e.l), e)$estimate
 # get coefficients
 predict(e.fits, s=e.l, type="coefficients")
  • low-level lm()
lm.fit(x, y)$coefficients
  • cross-validation
all.folds <- cv.folds(100,10)  # generate cross-val partition
all.folds[[i]]                 # get i-th partition

Mixture model fitting

 library(mclust)
 mm <- Mclust(v, G=2)
 
 k <- mm$parameters$pro
 a <- mm$parameters$mean
 s <- sqrt( mm$parameters$variance$sigmasq )
 
 plot(density(v,bw=0.2), ylim=c(0,0.5), xlim=c(-5,5), xlab="Peak intensity", ylab="Frequency", main="",lwd=4)
 par(new=T)
 plot(function(x) k[1]*dnorm(x, mean=a[1], sd=s[1]) + k[2]*dnorm(x, mean=a[2], sd=s[2]), -5, 5, ylim=c(0,0.5), col="red", xlab="", ylab="", lwd=4)

SVM

library(e1071)
summary(svm(as.factor(GROUP) ~ mir328 + mir222 + mir21 + mir197  ,   data=dm, cross=27,   cost=100, gamma=0.25, kernel="linear"))
svm(as.factor(GROUP) ~ mir328 + mir222 + mir21 + mir197  ,   data=dm,   cost=10, gamma=0.25, kernel="linear")
table(true=m[,1], predicted=predict(osvm4, mmir))
Commonly used machine learning packages in R http://cran.r-project.org/web/views/MachineLearning.html
The tune.svm function is very useful to explore SVM parameters http://rss.acs.unt.edu/Rdoc/library/e1071/html/tune.control.html

Non-negative matrix factorization

(http://nmf.r-forge.r-project.org/)

Load matrix

LY1 <- read.table("MatPromBind.LY1.txt", header = TRUE, row.names=1)
LY1.mat <- data.matrix(LY1)

Estimates cophenetic and other metrics

estim.LY1 <- nmfEstimateRank(LY1.mat, range=2:6, nrun=40)
plot(estim.LY1)

Run nmf with seed="random" and method="brunet". Also available seed=none/ica/nndsvd, method=lee/offset/nsnmf/penmf/als

res <- nmf(LY1.mat, 4)
metaHeatmap(res)

When seed="random", run nmf multiple times (nrun) to create the consensus matrix.

res.multirun <- nmf(LY1.mat, 4, nrun=100)
metaHeatmap(res.multirun)

Get the matrix of the basis components (similar to principal components in PCA)

res.basis <- basis(res.basis)

Extract the features that are the most specific to each basis vector

res.feat <- extractFeatures(res, method="max")
n <- lapply(res.feat, function(i) rownames(res)[i])

Microarray analysis

  • load CEL files in R and calculate calls, expression summary
Data<-ReadAffy(filenames=c("test.CEL"))     
pData(Data)
Calls<-mas5calls(Data)
calls<-Calls@exprs
# below follows Halfon's Genome Biology' paper
data_exp<-expresso(Data, bgcorrect.method="mas", normalize.method="loess", pmcorrect.method="mas", summary.method="medianpolish", summary.param=list(na.rm=TRUE))
exp=Data_exp@exprs
exp_exp<-exp(exp) # only if you want the raw values and not the logs
  • Simple quantile normalization
m <- read.csv("datamatrix.txt", sep="\t", row.names=1, header=T, check.names = FALSE)
m <- log2(m+1)  # optional .. do not use if data already log-transformed
library(limma)
mn <- normalizeQuantiles(m)


  • make color map for Olivier's Heatmap drawing module
ramp <- colorRamp(c("yellow", "black", rgb(64/255,0/255,128/255) ))
cm   <- ramp(seq(0, 1, length = 100))/255
write.table(cm, file="cmap-methylation.txt", sep="\t", quote=F, row.names=F, col.names=F)
  • speed

Reading large tables into R

1. Read in a few lines to get the table column classes
2. Get number of rows in the table using "wc" system command
3. Read in the entire table with nrows and column classes specified

E.g. 
fname= "datatable.txt"
tab5rows <- read.table(fname, header = TRUE, nrows = 5)
classes <- sapply(tab5rows, class)
nlines= system(paste("wc -l ", fname), intern= TRUE)
tabAll <- read.table(fname, header = TRUE, colClasses = classes, nrows= nlines)

more details: http://www.biostat.jhsph.edu/~rpeng/docs/R-large-tables.html

Another way (reads double-filled matrix with header and row names):

filename      <- "NCI-60-HG133Plus-RMA.txt.t.sync"
cntf          <- count.fields(filename, sep="\t")
numcols       <- cntf[1]
colcls        <- c(list(""), as.list(0.1*1:(numcols-1)))
mm            <- scan(file=filename, sep="\t", skip=1, what=colcls)
mmc           <- mm[2:numcols]
co            <- do.call(cbind, mmc)
rownames(co)  <- mm[[1]]
h             <- scan(file=filename, nmax=numcols, what="", sep="\t")
h             <- h[-1]
colnames(co)  <- h

Survival analysis

Survival analysis using penalized Cox regression (assumes gene expression is in m2, m1 contains time and status)

library(glmnet)
library(survival)
a <- 0.1
cv.fit <- cv.glmnet(m2, Surv(m1$TIME, m1$DEATH), family="cox", maxit=1000, alpha=a)
fit <- glmnet(m2, Surv(m1$TIME, m1$DEATH), family="cox", maxit=1000, alpha=a)
Coefficients<-coef(fit,s=cv.fit$lambda.min) 
Active.Index<-which(Coefficients!=0) 
Active.Coefficients<-Coefficients[Active.Index] 
colnames(m2)[Active.Index]

Univariate Cox analysis of non-zero coefficients

for (f in colnames(m2)[Active.Index]) { print(f); print(summary(coxph(Surv(m1$TIME, m1$DEATH) ~ m2[,f]))$coefficients) }

Simple KM plot

plot(survfit(Surv(m1$TIME, m1$DEATH) ~ 1))


Make an R package vignette

  • Write a Sweave file (latex and S)
  • Run in R
Sweave("IRanges_rtracklayer.Snw")

This will produce the file IRanges_rtracklayer.tex, that can be easily converted to pdf

R CMD pdflatex IRanges_rtracklayer.tex 
  • To produce the R chunks run in R:
Stangle("IRanges_rtracklayer.Snw")

This will produce the file IRanges_rtracklayer.R, that can contains all commands in R

  • R Sweave does not like undescores "_" in the name of the Sweave file.

Especially if you produce plots within the .Snw file (using plot, boxplot, etc), the .tex file produced will not understand the underscore and will fail to use the PDF or EPS figure's path.

Latex will then complain that it cannot find the figures.

This is caused by the underscore "_" that the .Snw filename has.

R tutorials

  • Online tutorial

Heatmaps: http://www2.warwick.ac.uk/fac/sci/moac/currentstudents/peter_cock/r/heatmap/

A blog about R: http://statisticsr.blogspot.com/

Computational Statistics course: http://stat.ethz.ch/teaching/lectures/SS_2006/CompStat

  • other useful packages
http://bioinf.wehi.edu.au/limma/
  • R and GSEA
http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/R-GSEA_Readme
Personal tools