Elementolab/R tutorial
From Icbwiki
| Revision as of 15:31, 26 March 2012 Ole2001 (Talk | contribs) (→Statistical tests, LIMMA, etc) ← Previous diff |
Revision as of 15:33, 19 April 2012 Ole2001 (Talk | contribs) (→R tutorials) Next diff → |
||
| Line 588: | Line 588: | ||
| http://bioinf.wehi.edu.au/limma/ | http://bioinf.wehi.edu.au/limma/ | ||
| + | |||
| + | * R and GSEA | ||
| + | |||
| + | http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/R-GSEA_Readme | ||
Revision as of 15:33, 19 April 2012
Download, build, call R
- Downloading R
http://www.r-project.org/
- Calling R from PERL
http://search.cpan.org/~gmpassos/Statistics-R-0.02/lib/Statistics/R.pm
- To build R from source under Mac OS X
1. Download R source code 2. Download mandatory compilers (GNU C++ and Fortran) http://r.research.att.com/tools/ 3. Unpack your R source code tar xvfz R-2.10.1.tar.gz 4. Configure, make and install. NOTE: you have to specify the -arch option for F77 and FC otherwise configure won't know the architecture and won't work. cd R-2.10.1 r_arch=`arch` R_TGDIR=/path/to/your/destination/dir/ ./configure --prefix=$R_TGDIR r_arch=$arch CC="gcc -arch $arch" \ CXX="g++ -arch $arch" F77="gfortran -arch $arch" \ FC="gfortran -arch $arch" OBJC="gcc -arch $arch" \ --x-includes=/usr/X11/include --x-libraries=/usr/X11/lib \ --with-blas='-framework vecLib' --with-lapack make make install * details: http://cran.r-project.org/bin/macosx/RMacOSX-FAQ.html#Building-R-from-sources
- To make the standalone RMath library, make sure you finished building R using the above steps, and then:
cd R-2.10.1/src/nmath/standalone make make install
- Installing R packages/libraries locally (eg on a Unix server)
mkdir $HOME/R/library export R_LIBS_USER=$HOME/R/library wget http://cran.r-project.org/src/contrib/SparseM_0.86.tar.gz R CMD INSTALL SparseM_0.86.tar.gz
Vectors and matrices
- vector creation
v <- c(7,5,2,9) v <- scan()
- vector manipulation
sqrt(v) # applies sqrt to all elements in vector v[1] # returns element 1 v[1:3] # returns element 1 to 3 v[c(1,4)] # returns elements 1 and 4 only v[-3] # returns all elements except 3
- sorting
sort(v) # sorts the content of the vector order(v) # sorts the vector, then returns the position of each element in the original vector
- conditional operators
v<2 # shows which elements of v are less than 2 v==9 # shows which elements are equal to 9 v<2 & v>0 # shows which elements are between 0 and 2 (non-included) v[v<2] # shows the elements which are less than 2
- Shuffling a vector
sample(mat[1,])
- a simple function to quantize a vector using equally populated bins
quantize <- function(x, bins=10) {
mu <- 100 / bins
binints <- quantile(x, probs = mu*c(1:bins)/100)
binints[bins] <- binints[bins] + 1
n <- length(x)
o <- rep(-1, n)
for (i in 1:n) {
for (k in 1:bins) {
if (x[i] <= binints[k]) {
o[i] <- k
break
}
}
}
o
}
- matrix input from disk
mat <- read.csv("testmat.txt", sep="\t", row.names=1, header=T) # reads in a TSV formatted table, with column and row names
# use check.names = FALSE to keep your headers unchanged
- save a matrix to disk
write.table(mat, file="file.txt", sep="\t", quote=F, row.names=T, col.names=NA) # use format(mat, digits=2, trim=T) in order to restrict the floating point precision of the output matrix
- basic matrix manipulation
dim(mat) # returns dimension of matrix (n x m, where n is the number of rows) mat[1,4] # returns entry 1,4 of mat mat[5,] # returns the 5th row of mat mat[,10] # returns the 10th column of mat t(mat) # transpose mat
- merge two vectors/matrices/data.frames by rownames
ME <- cbind(mat1, mat2[row.names(mat1), ]) #missing values will be labeled "NA" # use MEnoNA <- ME[ !is.na(ME[,2]), ] to remove NAs
- Eisen heatmap visualization of a matrix
(For users on Windows (win32), please be sure that you have installed the made4 package; instructions.) library(made4) # do once, at the beginning of a session heatplot(as.matrix(mat), Rowv=NA, Colv=NA)
- apply a function to all rows or all cols of a matrix
apply(mat, 1, mean) # returns a vector containing the mean of each row apply(mat, 1, scale) # variance-normalize rows of a matrix apply(mat, 2, mean) # return mean of each column
Clustering
- hierarchical clustering
d <- dist(mat) # calculate a distance matrix (stored in d) d <- as.dist(mat) # convert a matrix (mat) to the distance matrix format (stored in d) h <- hclust(d, method="complete") # build tree based on distance matrix plot(h) # plot tree (note that the terminal branch have meaningless lengths) plot(h, hang=-1) # plot tree (gives the terminal branches the correct length) str(as.dendrogram(h)) # output text tree with distances
More: statistical evaluation of hierarchical clustering - http://www.is.titech.ac.jp/~shimo/prog/pvclust/ - http://www.statmethods.net/advstats/cluster.html
- k-means
km <- kmeans(mat, 3) # k-means with 3 clusters; stores the results in variable km km <- kmeans(mat, 3, nstart=10, iter.max=30); # same with additional parameters : number of runs, maximum number of iterations, etc. mat[km$cluster==1,] # select the first cluster
- filter genes based on variance
s <- apply(mat, 1, sd) # get std dev summary(s>0.5) # see how many genes have standard deviation greater than 0.5 vmat <- mat[s>0.5,] # keep only genes that have variation > 0.5
Plots
- how to do a boxplot
boxplotmat <- read.csv("boxplotmat.txt", sep="\t", row.names=1, header=T, check.names=FALSE) # this matrix should have the longest column 1st
summary(boxplotmat) # allows you to see the contents of your matrix
pc1 <- boxplotmat[,1] # to save the first column of your matrix. This will allow you to select which columns you wish to include/exclude from the boxplot
boxplot(pc1,pc2...,pc8, outline = FALSE, names= c("\pc1 label\", "\pc2 label\",...,"\pc8 label\"),main="\boxplot title\", col=c("red","green"))
#outline=FALSE will omit drawing the outliers (to include outliers use option \=TRUE\)
#make sure that the labels specified in names= c("\pc1 label\", "\pc2 label\") match the order (and number) of data vectors plotted (/pc1,pc2/ in this case)
#notch=TRUE draws a notch in each side of the boxes. If the notches of two plots do not overlap this is ‘strong evidence’ that the two medians differ(Chambers et al,1983,p62)
- how to do a basic plot
plot(x = row.names(mat),y = mat[,1], xlab = "\x-axis label\", ylab = "\y-axis label\", type = "p", main = "\title of graph\") #type = "h" (histogram),"p" (points), "s" (stepwise)
- plot a profile with error bars
plot.errorbars <- function(a,s) {
m <- length(a)
plot(1:m, a, type="l")
segments(1:m, a, 1:m, a+s)
segments(1:m, a, 1:m, a-s)
segments(c(1:m)-0.2, a+s, c(1:m)+0.2, a+s)
segments(c(1:m)-0.2, a-s, c(1:m)+0.2, a-s)
}
# assume that as is a vector of average values, st is a vector of standard deviations plot.errorbars(av, st)
- sort the genes according to the clustering, then draw a heat map
km <- kmeans(vmat, 10) # cluster vmat_sorted <- mat[order(km$cluster), ] heatplot(vmat_sorted, Rowv=NA, Colv=NA)
- barplot()
use las=2 to show labels vertically
- multiple plots and legend
#create extra margin room on the right for the legend
par(oma = c(0, 0, 0, 5))
#plot all the data together
plot(colMeans(cluster1), type="l", col=2, xlab="TSS", ylab="Avg. density", main="Avg. density per cluster")
points(colMeans(cluster2), type="l", col=3)
points(colMeans(cluster3), type="l", col=4)
points(colMeans(cluster4), type="l", col=5)
points(colMeans(cluster5), type="l", col=6)
points(colMeans(cluster6), type="l", col=1)
#allow the legend to print outside the plot area
par(xpd=NA)
#show the legend
legend("topright", inset = c(-.28, .0), c("cluster1", "cluster2", "cluster3", "cluster4", "cluster5", "cluster6"), col=c(2,3,4,5,6,1), lty=1)
Statistical tests, LIMMA, etc
- finding differentially expressed genes using the t-test
t.test(v1, v2) # does a two-tailed t-test t.test(v1, v2, alternative="greater") t.test(v1, v2)$p.value # return only p-value for t-test
- apply t-test to all rows of a matrix
my.t.test <- function(x) { t.test(x[1:14], x[15:27])$p.value }
my.t.test(a[2,]) # equivalent to t.test()
p <- t(apply(a, 1, my.t.test))
- To correct for multiple testing
pa <- p.adjust(p, method="BH")
- To find all rows with p-values such that FDR=5%
a[pa<0.05,]
- In fact, it can all be done in one line, without function naming:
p <- apply(a, 1, function(x) { t.test(x[1:14], x[15:27])$p.value } )
as.matrix(p) # because apply produces a row and not a column. t(p) would work too
Paired version, returns t-statistic and p-value
pt <- apply(a, 1, function(x) { tt<-t.test(x[1:14], x[15:27], paired=T); c(tt$statistic, tt$p.value) } )
pt <- t(pt)
a[ pt[,2]<0.05, ] # subset of a with t-test pvalue < 0.05
- sort rows by p-value
heatplot(sclero[ order(p, decreasing=T), ], Rowv=NA, Colv=NA)
- analysis of variance
measures <- c( .. ) # vector with 12 elements groups <- c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3) # define group membership for each measure myaov <- aov( measures ~ as.factor(groups) ) summary( myaov ) anova(myaov)$'Pr(>F)'
- Calculating the correlation between two random variables
cor.test() cor(mat[1,], mat[,2]) #can also plot using: plot(mat[,1], mat[,2]) cor(mat[1,], mat[,2], method="spearman") #Spearman's rank correlation rho cor(mat[1,], mat[,2], method="kendall") #Kendall's rank correlation tau
- The partial correlation can be very useful http://www.yilab.gatech.edu/pcor.html
pcor.test(x, y, z); # calc cor(x,y) given z estimate p.value statistic n gn Method Use 1 -0.1424801 2.843382e-09 -5.940394 1706 1 Pearson Var-Cov matrix
- Fisher's test
fisher.test(matrix( c(28, 16, 4 , 1), nrow=2, ncol=2))
- Estimate sample size requirements for a study:
SampleSizeProportions-package Package documentation propdiff.freq(len, p1.estimate, p2.estimate, level = 0.95) #len= length of the confidence interval desired #p1.estimate= A point estimate for the binomial proportion for the first population #p2.estimate= A point estimate for the binomial proportion for the second population #level= confidence level desired (len=0.95 typically) #Read more:[http://www.ehow.com/how_5212437_calculate-sample-size-proportion.html#ixzz1IaMaQlXH How to calculate sample for Proportion]
- Differential expression using LIMMA
Assume we have 3 WT samples and 3 unpaired mutant samples (labelled dd) and you did RNA-seq on these samples.
GENE Description WT1 WT2 WT3 dd1 dd2 dd3 Akirin1 akirin-1 56.18 61.31 89.48 53.54 63.38 91.83 Akirin2 akirin-2 254.91 243.65 244.12 274.66 233.75 272.08 Akna AT-hook-containing transcription factor 10.29 11.50 16.27 14.04 14.94 11.02 Aknad1 AKNA domain containing 1 0.00 0.00 0.00 0.00 0.00 0.00 Akp3 intestinal-type alkaline phosphatase precursor 0.00 0.00 0.00 0.00 0.00 0.00 Akr1a4 alcohol dehydrogenase [NADP+] 1197.67 1354.26 1264.58 1260.24 1361.56 1393.50 ...
How do we determine which genes are differentially expressed using LIMMA ?
1. create a file called design.txt that will contaim the LIMMA design matrix (tab separated columns).
S W WTvsdd WT1 1 1 WT2 1 1 WT3 1 1 dd1 1 0 dd2 1 0 dd3 1 0
2. use LIMMA
library(limma)
design <- read.table("design.txt", header=T, row.names=1)
esettmp <- read.csv("Table_genes.txt", header=T, row.names=1, check.names=F, sep="\t")
eset <- esettmp[,-1]
varfilter <- apply(eset, 1, sd)>0.5
eset <- eset[varfilter,]
esettmp <- esettmp[varfilter,]
leset <- log2(eset+1)
fit <- lmFit(leset, design)
fit <- eBayes(fit)
oo <- topTable(fit, coef="WTvsdd", adjust="BH", number=40000)
write.table(cbind(esettmp[oo[,1],], oo), file="Table_genes_LIMMA_log2.txt", sep="\t", quote=FALSE, col.names=NA)
Principal Component Analysis
- Principal component analysis (PCA) and biplot
p <- princomp(m, cor=T) biplot(p, pc.biplot=T, cex=0.7, expand=3)
- The loadings (arrows) don't always need to be plotted, in which case, one has to revert to a monoplot
p<-prcomp(t(sm), scale=T, cor=T) lam <- sqrt(NROW(p$x)) * p$sdev[1:2] plot(p$x[,1:2]/lam, pch=".", col="white", xlim=c(-0.75,0.75), ylim=c(-0.75, 0.75)) text(p$x[,1:2]/lam, row.names(p$x), pos=NULL)
- More info: the prcomp function returns an object with 3 slots
p$sdev # contains the square root of the eigenvalues p$rotations # contains the eigenvectors p$x # contains the projected data
- PCA using ADE4 module
https://stat.ethz.ch/pipermail/bioconductor/2003-November/003132.html
- 3D PCA using rgl
library(rgl) pca <- princomp(mmir, cor=T) plot3d(pca$scores [,1:3],xlab="Component 1",main="My 3D PCA",ylab="Component 2", zlab="Component 3", box=F, axes=T) spheres3d(pca$scores, radius=0.5, col="red") text3d(pca$scores[,1:3], text=rownames(pca$scores), adj=1.3, cex=0.8) # adj controls how close the text is to the points, cex controls text size
Multi-dimensional Scaling
- multi-dimensional scaling (MDS)
library(MASS) im <- isoMDS(as.dist(1-cor(t(m))), maxit=1000) eqscplot(im$points, tol=0.3, pch="+") text(im$points, row.names(t(m)))
- Sammon's mapping is another type of MDS:
im <- sammon(as.dist(1-cor(t(m))))
Regression and cross validation
- logistic regression http://www.ats.ucla.edu/stat/r/dae/logit.htm
fit <- glm(z ~ x + y, family=binomial(link="logit")) predict(fit, data.frame(x=mean(x), y=1), type='response')
more info http://www.stat.nthu.edu.tw/~swcheng/Teaching/stat5230/lab/02_BinomialData_EstimationProblem.html
- To evaluate the contribution of individual predictors:
drop1(fit, test="Chi")
- LASSO regression
library(lars) # fits TP4.lars <- lars( m, TP4, type="lasso") # determine s by cross-val TP4.cv <- cv.lars( m, TP4, type="lasso") TP4.s <- TP4.cv$fraction[order(TP4.cv$cv)[1]] # evaluate fit using same data cor.test ( predict(TP4.lars, m, type="fit", mode="fraction", s=TP4.s)$fit, TP4) # print coefficients coef(TP4.lars, mode="fraction", s=TP4.s)
- Elastic Net regression (e=response, m=predictors)
library(lars) library(elasticnet) lambda <- 0.5 e.cv <- cv.enet( m, e, lambda, s=seq(0,1,length=100), mode="fraction", trace=F, K=10) s <- e.cv$s[order(e.cv$cv)[1]] e.fits <- enet( mnona, enona, lambda=lambda) # regression coefficients as.matrix(predict(e.fits, s=s, type="coefficients", mode="fraction")$coefficients) # predict and evaluate fit cor.test (predict(e.fits, m, type="fit", mode="fraction", s=s)$fit, e)
- faster LASSO/ElasticNet regression using the very fast glmnet library
library(glmnet) e.cv <- cv.glmnet( m, e, nfolds=10) e.l <- e.cv$lambda.min alpha <- 0.5 # 1.0 = LASSO, use only >=0.01 # fit e.fits <- glmnet( m, e, family="gaussian", alpha=alpha, nlambda=100) # predict and compare to training data cor.test( predict(e.fits, m, type="response", s=e.l), e)$estimate # get coefficients predict(e.fits, s=e.l, type="coefficients")
- low-level lm()
lm.fit(x, y)$coefficients
- cross-validation
all.folds <- cv.folds(100,10) # generate cross-val partition all.folds[[i]] # get i-th partition
Mixture model fitting
library(mclust) mm <- Mclust(v, G=2) k <- mm$parameters$pro a <- mm$parameters$mean s <- sqrt( mm$parameters$variance$sigmasq ) plot(density(v,bw=0.2), ylim=c(0,0.5), xlim=c(-5,5), xlab="Peak intensity", ylab="Frequency", main="",lwd=4) par(new=T) plot(function(x) k[1]*dnorm(x, mean=a[1], sd=s[1]) + k[2]*dnorm(x, mean=a[2], sd=s[2]), -5, 5, ylim=c(0,0.5), col="red", xlab="", ylab="", lwd=4)
SVM
library(e1071) summary(svm(as.factor(GROUP) ~ mir328 + mir222 + mir21 + mir197 , data=dm, cross=27, cost=100, gamma=0.25, kernel="linear")) svm(as.factor(GROUP) ~ mir328 + mir222 + mir21 + mir197 , data=dm, cost=10, gamma=0.25, kernel="linear") table(true=m[,1], predicted=predict(osvm4, mmir))
Commonly used machine learning packages in R http://cran.r-project.org/web/views/MachineLearning.html The tune.svm function is very useful to explore SVM parameters http://rss.acs.unt.edu/Rdoc/library/e1071/html/tune.control.html
Non-negative matrix factorization
(http://nmf.r-forge.r-project.org/)
Load matrix
LY1 <- read.table("MatPromBind.LY1.txt", header = TRUE, row.names=1)
LY1.mat <- data.matrix(LY1)
Estimates cophenetic and other metrics
estim.LY1 <- nmfEstimateRank(LY1.mat, range=2:6, nrun=40) plot(estim.LY1)
Run nmf with seed="random" and method="brunet". Also available seed=none/ica/nndsvd, method=lee/offset/nsnmf/penmf/als
res <- nmf(LY1.mat, 4) metaHeatmap(res)
When seed="random", run nmf multiple times (nrun) to create the consensus matrix.
res.multirun <- nmf(LY1.mat, 4, nrun=100) metaHeatmap(res.multirun)
Get the matrix of the basis components (similar to principal components in PCA)
res.basis <- basis(res.basis)
Extract the features that are the most specific to each basis vector
res.feat <- extractFeatures(res, method="max") n <- lapply(res.feat, function(i) rownames(res)[i])
Microarray analysis
- load CEL files in R and calculate calls, expression summary
Data<-ReadAffy(filenames=c("test.CEL"))
pData(Data)
Calls<-mas5calls(Data)
calls<-Calls@exprs
# below follows Halfon's Genome Biology' paper
data_exp<-expresso(Data, bgcorrect.method="mas", normalize.method="loess", pmcorrect.method="mas", summary.method="medianpolish", summary.param=list(na.rm=TRUE))
exp=Data_exp@exprs
exp_exp<-exp(exp) # only if you want the raw values and not the logs
- Simple quantile normalization
m <- read.csv("datamatrix.txt", sep="\t", row.names=1, header=T, check.names = FALSE)
m <- log2(m+1) # optional .. do not use if data already log-transformed
library(limma)
mn <- normalizeQuantiles(m)
- make color map for Olivier's Heatmap drawing module
ramp <- colorRamp(c("yellow", "black", rgb(64/255,0/255,128/255) ))
cm <- ramp(seq(0, 1, length = 100))/255
write.table(cm, file="cmap-methylation.txt", sep="\t", quote=F, row.names=F, col.names=F)
- speed
Reading large tables into R
1. Read in a few lines to get the table column classes
2. Get number of rows in the table using "wc" system command
3. Read in the entire table with nrows and column classes specified
E.g.
fname= "datatable.txt"
tab5rows <- read.table(fname, header = TRUE, nrows = 5)
classes <- sapply(tab5rows, class)
nlines= system(paste("wc -l ", fname), intern= TRUE)
tabAll <- read.table(fname, header = TRUE, colClasses = classes, nrows= nlines)
more details: http://www.biostat.jhsph.edu/~rpeng/docs/R-large-tables.html
Another way (reads double-filled matrix with header and row names):
filename <- "NCI-60-HG133Plus-RMA.txt.t.sync"
cntf <- count.fields(filename, sep="\t")
numcols <- cntf[1]
colcls <- c(list(""), as.list(0.1*1:(numcols-1)))
mm <- scan(file=filename, sep="\t", skip=1, what=colcls)
mmc <- mm[2:numcols]
co <- do.call(cbind, mmc)
rownames(co) <- mm[[1]]
h <- scan(file=filename, nmax=numcols, what="", sep="\t")
h <- h[-1]
colnames(co) <- h
Survival analysis
Survival analysis using penalized Cox regression (assumes gene expression is in m2, m1 contains time and status)
library(glmnet) library(survival) a <- 0.1 cv.fit <- cv.glmnet(m2, Surv(m1$TIME, m1$DEATH), family="cox", maxit=1000, alpha=a) fit <- glmnet(m2, Surv(m1$TIME, m1$DEATH), family="cox", maxit=1000, alpha=a) Coefficients<-coef(fit,s=cv.fit$lambda.min) Active.Index<-which(Coefficients!=0) Active.Coefficients<-Coefficients[Active.Index] colnames(m2)[Active.Index]
Univariate Cox analysis of non-zero coefficients
for (f in colnames(m2)[Active.Index]) { print(f); print(summary(coxph(Surv(m1$TIME, m1$DEATH) ~ m2[,f]))$coefficients) }
Simple KM plot
plot(survfit(Surv(m1$TIME, m1$DEATH) ~ 1))
Make an R package vignette
- Write a Sweave file (latex and S)
- Run in R
Sweave("IRanges_rtracklayer.Snw")
This will produce the file IRanges_rtracklayer.tex, that can be easily converted to pdf
R CMD pdflatex IRanges_rtracklayer.tex
- To produce the R chunks run in R:
Stangle("IRanges_rtracklayer.Snw")
This will produce the file IRanges_rtracklayer.R, that can contains all commands in R
- R Sweave does not like undescores "_" in the name of the Sweave file.
Especially if you produce plots within the .Snw file (using plot, boxplot, etc), the .tex file produced will not understand the underscore and will fail to use the PDF or EPS figure's path.
Latex will then complain that it cannot find the figures.
This is caused by the underscore "_" that the .Snw filename has.
R tutorials
- BioConductor tutorial http://manuals.bioinformatics.ucr.edu/home/R_BioCondManual
- R for deep sequencing ..... an excellent tutorial http://manuals.bioinformatics.ucr.edu/home/ht-seq
- Online tutorial
Heatmaps: http://www2.warwick.ac.uk/fac/sci/moac/currentstudents/peter_cock/r/heatmap/
A blog about R: http://statisticsr.blogspot.com/
Computational Statistics course: http://stat.ethz.ch/teaching/lectures/SS_2006/CompStat
- other useful packages
http://bioinf.wehi.edu.au/limma/
- R and GSEA
http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/R-GSEA_Readme
