Lab3 (RNW)
%
% NOTE -- ONLY EDIT THE .Rnw FILE!!! The .tex file is
% likely to be overwritten.
%
% \VignetteIndexEntry{Seattle Lab 3}
%\VignetteDepends{Biobase. golubEsets, ellipse,lattice, mva, MASS }
%\VignetteKeywords{Microarray}
\documentclass[12pt]{article}
\usepackage{amsmath,pstricks}
\usepackage[authoryear,round]{natbib}
\usepackage{hyperref}
\textwidth=6.2in
\textheight=8.5in
%\parskip=.3cm
\oddsidemargin=.1in
\evensidemargin=.1in
\headheight=-.3in
\newcommand{\scscst}{\scriptscriptstyle}
\newcommand{\scst}{\scriptstyle}
\bibliographystyle{plainnat}
\begin{document}
\section*{Clustering Using R and Bioconductor}
We continue our extended example involving the data set from \citet{Golub99}.
In this lab we will consider
computing a number of different distances and
clustering techniques applied to these data.
We will look at distance calculations, followed by multidimensional scaling
and then examine a number of different clustering methods.
<<loadlibs>>=
library(Seattle)
library(lattice)
@
We will set up the data from scratch once again.
<<setup>>=
X<-exprs(golubTrain)
X[X<100]<-100
X[X>16000]<-16000
mmfilt <- function(r=5, d=500, na.rm=TRUE) {
function(x) {
minval <- min(x, na.rm=na.rm)
maxval <- max(x, na.rm=na.rm)
(maxval/minval > r) && (maxval-minval > d)
}
}
mmfun <- mmfilt()
ffun <- filterfun(mmfun)
sub <- genefilter(X, ffun )
sum(sub) ## Should get 3051
X <- X[sub,]
X <- log2(X)
golubTrainSub<-golubTrain[sub,]
golubTrainSub@exprs <- X
Y <- golubTrainSub$ALL.AML
Y<-paste(golubTrain$ALL.AML,golubTrain$T.B.cell)
Y<-sub("NA","",Y)
@
Now that the data are set up we are ready to start looking at the data.
\section*{Distances}
%%FIXME: get the ellipse package
We first compute the correlation distance between all the genes in
the selected subset from the training set.
%%FIXME: other distances?
<<cordist>>=
r<-cor(X)
dimnames(r)<-list(as.vector(Y),as.vector(Y))
d<-1-r
@
We rely on the \textit{ellipse} package for plotting some of these
values.
<<plotcorr>>=
plotcorr(r,
main="Leukemia data: Correlation matrix for 38 mRNA samples\n All 3,051 genes")
plotcorr(r,numbers=TRUE,
main="Leukemia data: Correlation matrix for 38 mRNA samples\n All 3,051 genes")
levelplot(r,col.region=heat.colors(50),
main="Leukemia data: Correlation matrix for 38 mRNA samples\n All 3,051 genes")
@
We can next look at multidimensional scaling using these data.
Multidimensional scaling is a data reduction method that is appropriate for
distance data. It is much like principal components (but it is not the
same -- except for Euclidean distances).
An interesting question is whether the data in the distance matrix can be
reduced to a smaller dimensional space.
Note, if we are measuring distances between samples on the basis of
$N=3051$ probes then we are essentially looking at points in 3,051 dimensional
space.
<<mds>>=
library(mva)
mds<- cmdscale(d, k=2, eig=TRUE)
plot(mds$points, type="n", xlab="", ylab="", main="MDS for ALL AML data, correlation matrix, G=3,051 genes, k=2")
text(mds$points[,1],mds$points[,2],Y, col=codes(factor(Y))+1, cex=0.8)
mds<- cmdscale(d, k=3, eig=TRUE)
pairs(mds$points, main="MDS for ALL AML data, correlation matrix, G=3,051 genes, k=3", pch=c("B","T","M")[codes(factor(Y))], col = codes(factor(Y))+1)
@
To assess how many components there are a \textit{scree} plot similar to that
used for principal components can be created.
This plot of the eigenvalues suggests that much of the information is
contained in the first component. One might consider using either three
or four components as well.
<<scree>>=
mdsScree <- cmdscale(d, k=8, eig=TRUE)
plot(mdsScree$eig, pch=18, col="blue")
@
\end{document}
% NOTE -- ONLY EDIT THE .Rnw FILE!!! The .tex file is
% likely to be overwritten.
%
% \VignetteIndexEntry{Seattle Lab 3}
%\VignetteDepends{Biobase. golubEsets, ellipse,lattice, mva, MASS }
%\VignetteKeywords{Microarray}
\documentclass[12pt]{article}
\usepackage{amsmath,pstricks}
\usepackage[authoryear,round]{natbib}
\usepackage{hyperref}
\textwidth=6.2in
\textheight=8.5in
%\parskip=.3cm
\oddsidemargin=.1in
\evensidemargin=.1in
\headheight=-.3in
\newcommand{\scscst}{\scriptscriptstyle}
\newcommand{\scst}{\scriptstyle}
\bibliographystyle{plainnat}
\begin{document}
\section*{Clustering Using R and Bioconductor}
We continue our extended example involving the data set from \citet{Golub99}.
In this lab we will consider
computing a number of different distances and
clustering techniques applied to these data.
We will look at distance calculations, followed by multidimensional scaling
and then examine a number of different clustering methods.
<<loadlibs>>=
library(Seattle)
library(lattice)
@
We will set up the data from scratch once again.
<<setup>>=
X<-exprs(golubTrain)
X[X<100]<-100
X[X>16000]<-16000
mmfilt <- function(r=5, d=500, na.rm=TRUE) {
function(x) {
minval <- min(x, na.rm=na.rm)
maxval <- max(x, na.rm=na.rm)
(maxval/minval > r) && (maxval-minval > d)
}
}
mmfun <- mmfilt()
ffun <- filterfun(mmfun)
sub <- genefilter(X, ffun )
sum(sub) ## Should get 3051
X <- X[sub,]
X <- log2(X)
golubTrainSub<-golubTrain[sub,]
golubTrainSub@exprs <- X
Y <- golubTrainSub$ALL.AML
Y<-paste(golubTrain$ALL.AML,golubTrain$T.B.cell)
Y<-sub("NA","",Y)
@
Now that the data are set up we are ready to start looking at the data.
\section*{Distances}
%%FIXME: get the ellipse package
We first compute the correlation distance between all the genes in
the selected subset from the training set.
%%FIXME: other distances?
<<cordist>>=
r<-cor(X)
dimnames(r)<-list(as.vector(Y),as.vector(Y))
d<-1-r
@
We rely on the \textit{ellipse} package for plotting some of these
values.
<<plotcorr>>=
plotcorr(r,
main="Leukemia data: Correlation matrix for 38 mRNA samples\n All 3,051 genes")
plotcorr(r,numbers=TRUE,
main="Leukemia data: Correlation matrix for 38 mRNA samples\n All 3,051 genes")
levelplot(r,col.region=heat.colors(50),
main="Leukemia data: Correlation matrix for 38 mRNA samples\n All 3,051 genes")
@
We can next look at multidimensional scaling using these data.
Multidimensional scaling is a data reduction method that is appropriate for
distance data. It is much like principal components (but it is not the
same -- except for Euclidean distances).
An interesting question is whether the data in the distance matrix can be
reduced to a smaller dimensional space.
Note, if we are measuring distances between samples on the basis of
$N=3051$ probes then we are essentially looking at points in 3,051 dimensional
space.
<<mds>>=
library(mva)
mds<- cmdscale(d, k=2, eig=TRUE)
plot(mds$points, type="n", xlab="", ylab="", main="MDS for ALL AML data, correlation matrix, G=3,051 genes, k=2")
text(mds$points[,1],mds$points[,2],Y, col=codes(factor(Y))+1, cex=0.8)
mds<- cmdscale(d, k=3, eig=TRUE)
pairs(mds$points, main="MDS for ALL AML data, correlation matrix, G=3,051 genes, k=3", pch=c("B","T","M")[codes(factor(Y))], col = codes(factor(Y))+1)
@
To assess how many components there are a \textit{scree} plot similar to that
used for principal components can be created.
This plot of the eigenvalues suggests that much of the information is
contained in the first component. One might consider using either three
or four components as well.
<<scree>>=
mdsScree <- cmdscale(d, k=8, eig=TRUE)
plot(mdsScree$eig, pch=18, col="blue")
@
\end{document}