Elementolab/R tutorial

From Icbwiki

Jump to: navigation, search

Contents

Download, build, call R

  • Downloading R
http://www.r-project.org/
  • Calling R from PERL
http://search.cpan.org/~gmpassos/Statistics-R-0.02/lib/Statistics/R.pm
  • To build R from source under Mac OS X
1. Download R source code
2. Download mandatory compilers (GNU C++ and Fortran)
   http://r.research.att.com/tools/
3. Unpack your R source code
   tar xvfz R-2.10.1.tar.gz
4. Configure, make and install. 
   NOTE: you have to specify the -arch option for F77 and FC otherwise configure won't know the architecture and won't work. 
cd R-2.10.1
r_arch=`arch`       
R_TGDIR=/path/to/your/destination/dir/
./configure --prefix=$R_TGDIR r_arch=$arch CC="gcc -arch $arch"   \
       CXX="g++ -arch $arch" F77="gfortran -arch $arch"           \
       FC="gfortran -arch $arch" OBJC="gcc -arch $arch"           \
       --x-includes=/usr/X11/include --x-libraries=/usr/X11/lib   \
       --with-blas='-framework vecLib' --with-lapack
make
make install
* details: http://cran.r-project.org/bin/macosx/RMacOSX-FAQ.html#Building-R-from-sources
  • To make the standalone RMath library, make sure you finished building R using the above steps, and then:
cd R-2.10.1/src/nmath/standalone
make 
make install
  • Installing R packages/libraries locally (eg on a Unix server)
mkdir $HOME/R/library
export R_LIBS_USER=$HOME/R/library
wget http://cran.r-project.org/src/contrib/SparseM_0.86.tar.gz
R CMD INSTALL SparseM_0.86.tar.gz  
  • Online tutorial

Heatmaps: http://www2.warwick.ac.uk/fac/sci/moac/currentstudents/peter_cock/r/heatmap/

A blog about R: http://statisticsr.blogspot.com/

Computational Statistics course: http://stat.ethz.ch/teaching/lectures/SS_2006/CompStat

  • other useful packages
http://bioinf.wehi.edu.au/limma/

Vectors and matrices

  • vector creation
v <- c(7,5,2,9)
v <- scan()
  • vector manipulation
sqrt(v)     # applies sqrt to all elements in vector
v[1]        # returns element 1
v[1:3]      # returns element 1 to 3
v[c(1,4)]   # returns elements 1 and 4 only
v[-3]       # returns all elements except 3
  • sorting
sort(v)     # sorts the content of the vector
order(v)    # sorts the vector, then returns the position of each element in the original vector
  • conditional operators
v<2         # shows which elements of v are less than 2
v==9        # shows which elements are equal to 9
v<2 & v>0   # shows which elements are between 0 and 2 (non-included) 
v[v<2]      # shows the elements which are less than 2
  • Shuffling a vector
sample(mat[1,])
  • a simple function to quantize a vector using equally populated bins
quantize <- function(x, bins=10) {
  mu <- 100 / bins
  binints <- quantile(x,  probs = mu*c(1:bins)/100)
  binints[bins] <- binints[bins] + 1
  n <- length(x)
  o <- rep(-1, n)
  for (i in 1:n) {    
    for (k in 1:bins) {
      if (x[i] <= binints[k]) {
        o[i] <- k
        break
      }
    }    
  }
  o
}
  • matrix input from disk
mat <- read.csv("testmat.txt", sep="\t", row.names=1, header=T) # reads in a TSV formatted table, with column and row names
# use check.names = FALSE to keep your headers unchanged
  • save a matrix to disk
write.table(mat, file="file.txt", sep="\t", quote=F, row.names=T, col.names=NA)
# use format(mat, digits=2, trim=T) in order to restrict the floating point precision of the output matrix
  • basic matrix manipulation
dim(mat) # returns dimension of matrix (n x m, where n is the number of rows)
mat[1,4]  # returns entry 1,4 of mat
mat[5,]  # returns the 5th row of mat
mat[,10] # returns the 10th column of mat
t(mat) # transpose mat


  • Eisen heatmap visualization of a matrix
(For users on Windows (win32), please be sure that you have installed the made4 package; instructions.)
library(made4)  # do once, at the beginning of a session 
heatplot(as.matrix(mat), Rowv=NA, Colv=NA)
  • apply a function to all rows or all cols of a matrix
apply(mat, 1, mean)  # returns a vector containing the mean of each row
apply(mat, 1, scale) # variance-normalize rows of a matrix
apply(mat, 2, mean)  # return mean of each column

Clustering

  • hierarchical clustering
 d <- dist(mat)        # calculate a distance matrix (stored in d)
 d <- as.dist(mat)     # convert a matrix (mat) to the distance matrix format (stored in d)
 h <- hclust(d, method="complete")   # build tree based on distance matrix
 plot(h)               # plot tree (note that the terminal branch have meaningless lengths)
 plot(h, hang=-1)      # plot tree (gives the terminal branches the correct length)
 str(as.dendrogram(h)) # output text tree with distances

More: statistical evaluation of hierarchical clustering - http://www.is.titech.ac.jp/~shimo/prog/pvclust/ - http://www.statmethods.net/advstats/cluster.html

  • k-means
 km <- kmeans(mat, 3)  # k-means with 3 clusters; stores the results in variable km
 km <- kmeans(mat, 3, nstart=10, iter.max=30); # same with additional parameters : number of runs, maximum number of iterations, etc.
 mat[km$cluster==1,] # select the first cluster
  • filter genes based on variance
s <- apply(mat, 1, sd)   # get std dev
summary(s>0.5)           # see how many genes have standard deviation greater than 0.5
vmat <- mat[s>0.5,]      # keep only genes that have variation > 0.5

Plots

  • how to do a boxplot
boxplotmat <- read.csv("boxplotmat.txt", sep="\t", row.names=1, header=T, check.names=FALSE) # this matrix should have the longest column 1st

summary(boxplotmat)   # allows you to see the contents of your matrix 
pc1 <- boxplotmat[,1] # to save the first column of your matrix. This will allow you to select which columns you wish to include/exclude from the boxplot

boxplot(pc1,pc2...,pc8, outline = FALSE, names= c("\pc1 label\", "\pc2 label\",...,"\pc8 label\"),main="\boxplot title\", col=c("red","green"))
        
        #outline=FALSE will omit drawing the outliers (to include outliers use option \=TRUE\)
        #make sure that the labels specified in names= c("\pc1 label\", "\pc2 label\") match the order (and number) of data vectors plotted (/pc1,pc2/ in this case)
        #notch=TRUE draws a notch in each side of the boxes. If the notches of two plots do not overlap this is ‘strong evidence’ that the two medians differ(Chambers et al,1983,p62)

  • how to do a basic plot
plot(x = row.names(mat),y = mat[,1], xlab = "\x-axis label\", ylab = "\y-axis label\", type = "p", main = "\title of graph\") 
#type = "h" (histogram),"p" (points), "s" (stepwise)
  • plot a profile with error bars
plot.errorbars <- function(a,s) {
 m <- length(a)
 plot(1:m, a, type="l")
 segments(1:m, a, 1:m, a+s)
 segments(1:m, a, 1:m, a-s)
 segments(c(1:m)-0.2, a+s, c(1:m)+0.2, a+s)
 segments(c(1:m)-0.2, a-s, c(1:m)+0.2, a-s)
}
# assume that as is a vector of average values, st is a vector of standard deviations
plot.errorbars(av, st)
  • sort the genes according to the clustering, then draw a heat map
 km <- kmeans(vmat, 10)  # cluster
 vmat_sorted <- mat[order(km$cluster), ] 
 heatplot(vmat_sorted, Rowv=NA, Colv=NA)
  • barplot()
use las=2 to show labels vertically 
  • multiple plots and legend
#create extra margin room on the right for the legend 
par(oma = c(0, 0, 0, 5))
#plot all the data together
plot(colMeans(cluster1), type="l", col=2, xlab="TSS", ylab="Avg. density", main="Avg. density per cluster")
points(colMeans(cluster2), type="l", col=3)
points(colMeans(cluster3), type="l", col=4)
points(colMeans(cluster4), type="l", col=5)
points(colMeans(cluster5), type="l", col=6)
points(colMeans(cluster6), type="l", col=1)
#allow the legend to print outside the plot area
par(xpd=NA)
#show the legend
legend("topright", inset = c(-.28, .0), c("cluster1", "cluster2", "cluster3", "cluster4", "cluster5", "cluster6"), col=c(2,3,4,5,6,1), lty=1)

Statistical tests

  • finding differentially expressed genes using the t-test
t.test(v1, v2)   # does a two-tailed t-test
t.test(v1, v2, alternative="greater")  
t.test(v1, v2)$p.value # return only p-value for t-test
  • apply t-test to all rows of a matrix
my.t.test <- function(x) { t.test(x[1:14], x[15:27])$p.value }
my.t.test(a[2,])     # equivalent to t.test()
p <- t(apply(a, 1, my.t.test))
  • To correct for multiple testing
 pa <- p.adjust(p, method="BH")
  • To find all rows with p-values such that FDR=5%
 a[pa<0.05,]
 
  • In fact, it can all be done in one line, without function naming:
p <- apply(a, 1, function(x) { t.test(x[1:14], x[15:27])$p.value } )
as.matrix(p)  # because apply produces a row and not a column. t(p) would work too
  • sort rows by p-value
heatplot(sclero[ order(p, decreasing=T), ], Rowv=NA, Colv=NA)
  • analysis of variance
measures <- c( .. )  # vector with 12 elements
groups <- c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3) # define group membership for each measure
myaov <- aov( measures ~ as.factor(groups) )
summary( myaov )
anova(myaov)$'Pr(>F)'

  • Calculating the correlation between two random variables
cor.test() 
cor(mat[1,], mat[,2])                    #can also plot using: plot(mat[,1], mat[,2])
cor(mat[1,], mat[,2], method="spearman") #Spearman's rank correlation rho
cor(mat[1,], mat[,2], method="kendall")  #Kendall's rank correlation tau
 
  • Fisher's test
fisher.test(matrix( c(28, 16, 4 , 1), nrow=2, ncol=2))
  • Estimate sample size requirements for a study:
SampleSizeProportions-package Package documentation

propdiff.freq(len, p1.estimate, p2.estimate, level = 0.95)     #len= length of the confidence interval desired 
                                                               #p1.estimate= A point estimate for the binomial proportion for the first population
                                                               #p2.estimate= A point estimate for the binomial proportion for the second population
                                                               #level= confidence level desired (len=0.95 typically)
                                                               #Read more:[http://www.ehow.com/how_5212437_calculate-sample-size-proportion.html#ixzz1IaMaQlXH How to calculate sample
                                                                for Proportion]

Principal Component Analysis

  • Principal component analysis (PCA) and biplot
p <- princomp(m, cor=T)
biplot(p, pc.biplot=T, cex=0.7, expand=3)
  • The loadings (arrows) don't always need to be plotted, in which case, one has to revert to a monoplot
p<-prcomp(t(sm), scale=T, cor=T)
lam <- sqrt(NROW(p$x)) * p$sdev[1:2]
plot(p$x[,1:2]/lam, pch=".", col="white", xlim=c(-0.75,0.75), ylim=c(-0.75, 0.75))
text(p$x[,1:2]/lam, row.names(p$x), pos=NULL)
  • More info: the prcomp function returns an object with 3 slots
p$sdev       # contains the square root of the eigenvalues
p$rotations  # contains the eigenvectors 
p$x          # contains the projected data
  • PCA using ADE4 module
https://stat.ethz.ch/pipermail/bioconductor/2003-November/003132.html
  • 3D PCA using rgl
library(rgl)
pca <- princomp(mmir, cor=T)
plot3d(pca$scores [,1:3],xlab="Component 1",main="My 3D PCA",ylab="Component 2", zlab="Component 3", box=F, axes=T) 
spheres3d(pca$scores, radius=0.5, col="red")
text3d(pca$scores[,1:3], text=rownames(pca$scores), adj=1.3, cex=0.8)  # adj controls how close the text is to the points, cex controls text size

Multi-dimensional Scaling

  • multi-dimensional scaling (MDS)
library(MASS)
im <- isoMDS(as.dist(1-cor(t(m))), maxit=1000)
eqscplot(im$points, tol=0.3, pch="+")
text(im$points, row.names(t(m)))

  • Sammon's mapping is another type of MDS:
im <- sammon(as.dist(1-cor(t(m))))

Regression and cross validation

fit <- glm(z ~ x + y,  family=binomial(link="logit"))
predict(fit, data.frame(x=mean(x), y=1), type='response')

more info http://www.stat.nthu.edu.tw/~swcheng/Teaching/stat5230/lab/02_BinomialData_EstimationProblem.html

  • To evaluate the contribution of individual predictors:
drop1(fit, test="Chi")
  • LASSO regression
library(lars)
# fits
TP4.lars <- lars( m, TP4, type="lasso")
# determine s by cross-val
TP4.cv <- cv.lars( m, TP4, type="lasso")
TP4.s  <- TP4.cv$fraction[order(TP4.cv$cv)[1]]
# evaluate fit using same data
cor.test ( predict(TP4.lars, m, type="fit", mode="fraction", s=TP4.s)$fit, TP4)
# print coefficients
coef(TP4.lars, mode="fraction", s=TP4.s)
  • Elastic Net regression (e=response, m=predictors)
 library(lars)
 library(elasticnet)
 lambda <- 0.5
 e.cv <- cv.enet( m, e, lambda, s=seq(0,1,length=100), mode="fraction", trace=F, K=10)
 s <- e.cv$s[order(e.cv$cv)[1]]
 e.fits <- enet( mnona, enona, lambda=lambda)
 # regression coefficients
 as.matrix(predict(e.fits, s=s, type="coefficients", mode="fraction")$coefficients)
 # predict and evaluate fit
 cor.test (predict(e.fits, m, type="fit", mode="fraction", s=s)$fit, e)	
  • faster LASSO/ElasticNet regression using the very fast glmnet library
 library(glmnet)
 e.cv  <- cv.glmnet( m, e, nfolds=10)
 e.l   <- e.cv$lambda.min
 alpha <- 0.5 # 1.0 = LASSO, use only >=0.01
 # fit
 e.fits <- glmnet( m, e, family="gaussian", alpha=alpha, nlambda=100)
 # predict and compare to training data
 cor.test( predict(e.fits, m, type="response", s=e.l), e)$estimate
 # get coefficients
 predict(e.fits, s=e.l, type="coefficients")
  • low-level lm()
lm.fit(x, y)$coefficients
  • cross-validation
all.folds <- cv.folds(100,10)  # generate cross-val partition
all.folds[[i]]                 # get i-th partition

Mixture model fitting

 library(mclust)
 mm <- Mclust(v, G=2)
 
 k <- mm$parameters$pro
 a <- mm$parameters$mean
 s <- sqrt( mm$parameters$variance$sigmasq )
 
 plot(density(v,bw=0.2), ylim=c(0,0.5), xlim=c(-5,5), xlab="Peak intensity", ylab="Frequency", main="",lwd=4)
 par(new=T)
 plot(function(x) k[1]*dnorm(x, mean=a[1], sd=s[1]) + k[2]*dnorm(x, mean=a[2], sd=s[2]), -5, 5, ylim=c(0,0.5), col="red", xlab="", ylab="", lwd=4)

SVM

library(e1071)
summary(svm(as.factor(GROUP) ~ mir328 + mir222 + mir21 + mir197  ,   data=dm, cross=27,   cost=100, gamma=0.25, kernel="linear"))
svm(as.factor(GROUP) ~ mir328 + mir222 + mir21 + mir197  ,   data=dm,   cost=10, gamma=0.25, kernel="linear")
table(true=m[,1], predicted=predict(osvm4, mmir))
Commonly used machine learning packages in R http://cran.r-project.org/web/views/MachineLearning.html
The tune.svm function is very useful to explore SVM parameters http://rss.acs.unt.edu/Rdoc/library/e1071/html/tune.control.html

Non-negative matrix factorization

(http://nmf.r-forge.r-project.org/)

Load matrix

LY1 <- read.table("MatPromBind.LY1.txt", header = TRUE, row.names=1)
LY1.mat <- data.matrix(LY1)

Estimates cophenetic and other metrics

estim.LY1 <- nmfEstimateRank(LY1.mat, range=2:6, nrun=40)
plot(estim.LY1)

Run nmf with seed="random" and method="brunet". Also available seed=none/ica/nndsvd, method=lee/offset/nsnmf/penmf/als

res <- nmf(LY1.mat, 4)
metaHeatmap(res)

When seed="random", run nmf multiple times (nrun) to create the consensus matrix.

res.multirun <- nmf(LY1.mat, 4, nrun=100)
metaHeatmap(res.multirun)

Get the matrix of the basis components (similar to principal components in PCA)

res.basis <- basis(res.basis)

Extract the features that are the most specific to each basis vector

res.feat <- extractFeatures(res, method="max")
n <- lapply(res.feat, function(i) rownames(res)[i])

Others

  • load CEL files in R and calculate calls, expression summary
Data<-ReadAffy(filenames=c("test.CEL"))     
pData(Data)
Calls<-mas5calls(Data)
calls<-Calls@exprs
# below follows Halfon's Genome Biology' paper
data_exp<-expresso(Data, bgcorrect.method="mas", normalize.method="loess", pmcorrect.method="mas", summary.method="medianpolish", summary.param=list(na.rm=TRUE))
exp=Data_exp@exprs
exp_exp<-exp(exp) # only if you want the raw values and not the logs
  • Simple quantile normalization
m <- read.csv("datamatrix.txt", sep="\t", row.names=1, header=T, check.names = FALSE)
m <- log2(m+1)  # optional .. do not use if data already log-transformed
library(limma)
mn <- normalizeQuantiles(m)



  • make color map for Olivier's Heatmap drawing module
ramp <- colorRamp(c("yellow", "black", rgb(64/255,0/255,128/255) ))
cm   <- ramp(seq(0, 1, length = 100))/255
write.table(cm, file="cmap-methylation.txt", sep="\t", quote=F, row.names=F, col.names=F)
  • speed

Reading large tables into R

1. Read in a few lines to get the table column classes
2. Get number of rows in the table using "wc" system command
3. Read in the entire table with nrows and column classes specified

E.g. 
fname= "datatable.txt"
tab5rows <- read.table(fname, header = TRUE, nrows = 5)
classes <- sapply(tab5rows, class)
nlines= system(paste("wc -l ", fname), intern= TRUE)
tabAll <- read.table(fname, header = TRUE, colClasses = classes, nrows= nlines)

more details: http://www.biostat.jhsph.edu/~rpeng/docs/R-large-tables.html

Another way (reads double-filled matrix with header and row names):

filename      <- "NCI-60-HG133Plus-RMA.txt.t.sync"
cntf          <- count.fields(filename, sep="\t")
numcols       <- cntf[1]
colcls        <- c(list(""), as.list(0.1*1:(numcols-1)))
mm            <- scan(file=filename, sep="\t", skip=1, what=colcls)
mmc           <- mm[2:numcols]
co            <- do.call(cbind, mmc)
rownames(co)  <- mm[[1]]
h             <- scan(file=filename, nmax=numcols, what="", sep="\t")
h             <- h[-1]
colnames(co)  <- h

Survival analysis

Survival analysis using penalized Cox regression (assumes gene expression is in m2, m1 contains time and status)

library(glmnet)
library(survival)
a <- 0.1
cv.fit <- cv.glmnet(m2, Surv(m1$TIME, m1$DEATH), family="cox", maxit=1000, alpha=a)
fit <- glmnet(m2, Surv(m1$TIME, m1$DEATH), family="cox", maxit=1000, alpha=a)
Coefficients<-coef(fit,s=cv.fit$lambda.min) 
Active.Index<-which(Coefficients!=0) 
Active.Coefficients<-Coefficients[Active.Index] 
colnames(m2)[Active.Index]

Univariate Cox analysis of non-zero coefficients

for (f in colnames(m2)[Active.Index]) { print(f); print(summary(coxph(Surv(m1$TIME, m1$DEATH) ~ m2[,f]))$coefficients) }

Simple KM plot

plot(survfit(Surv(m1$TIME, m1$DEATH) ~ 1))

Distributing R code

  • dump R code that can reconstruct the content of the dm R variable
dump("dm") 
  • Rscript lets you make command line executable scripts, starting with
#! /usr/bin/Rscript --vanilla --default-packages=utils,e1071

Make an R package vignette

  • Write a Sweave file (latex and R)
  • Run in R
Sweave("IRanges_rtracklayer.Snw")

This will produce the file IRanges_rtracklayer.tex, that can be easily converted to pdf

library(tools)
texi2dvi("High-throughput-Seq.tex", pdf=TRUE)
  • To produce the R chunks run in R:
Stangle("IRanges_rtracklayer.Snw")

This will produce the file IRanges_rtracklayer.R, that can contains all commands in R

  • R Sweave does not like undescores "_" in the name of the Sweave file.

Especially if you produce plots within the .Snw file (using plot, boxplot, etc), the .tex file produced will not understand the underscore and will fail to use the PDF or EPS figure's path.

Latex will then complain that it cannot find the figures.

This is caused by the underscore "_" that the .Snw filename has.

R tutorials

  • Online tutorial

Heatmaps: http://www2.warwick.ac.uk/fac/sci/moac/currentstudents/peter_cock/r/heatmap/

A blog about R: http://statisticsr.blogspot.com/

Computational Statistics course: http://stat.ethz.ch/teaching/lectures/SS_2006/CompStat

  • other useful packages
http://bioinf.wehi.edu.au/limma/
  • R and GSEA
http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/R-GSEA_Readme
Personal tools