Susan Holmes
August, 2 2018
Susan Holmes (c)
It is current practice to measure many variables on the same patients, we may have all the biometrical characteristics, height, weight, BMI, age as well as clinical variables such as blood pressure, blood sugar, heart rate for 100 patients, these variables will not be independent.
To start off, a useful toy example we’ll use is from the sports world; performances of decathlon athletes.
These are measurements of athletes’ performances in the decathlon: the variables m100, m400, m1500 are performance times in seconds for the 100 metres, 400 metres and 1500 meters respectively, ‘m110‘ is the time taken to finish the 110 meters hurdles whereas pole is the pole-jump height, weight is the length in metres the athletes threw the weight.
m100 | long | weight | highj | m400 | m110 | disc | pole | jave | m1500 | |
---|---|---|---|---|---|---|---|---|---|---|
1 | 11.25 | 7.43 | 15.48 | 2.27 | 48.90 | 15.13 | 49.28 | 4.70 | 61.32 | 268.95 |
2 | 10.87 | 7.45 | 14.97 | 1.97 | 47.71 | 14.46 | 44.36 | 5.10 | 61.76 | 273.02 |
3 | 11.18 | 7.44 | 14.20 | 1.97 | 48.29 | 14.81 | 43.66 | 5.20 | 64.16 | 263.20 |
4 | 10.62 | 7.38 | 15.02 | 2.03 | 49.06 | 14.72 | 44.80 | 4.90 | 64.04 | 285.11 |
diabetes=read.table(url("http://bios221.stanford.edu/data/diabetes.txt"),header=TRUE,row.names=1)
diabetes[1:4,]
## relwt glufast glutest steady insulin Group
## 1 0.81 80 356 124 55 3
## 3 0.94 105 319 143 105 3
## 5 1.00 90 323 240 143 3
## 7 0.91 100 350 221 119 3
Operational Taxon Unit read counts in a microbial ecology study; the columns represent different ‘species’ of bacteria, the rows are labeled for the samples.
469478 208196 378462 265971 570812
EKCM1.489478 0 0 2 0 0
EKCM7.489464 0 0 2 0 2
EKBM2.489466 0 0 12 0 0
PTCM3.489508 0 0 14 0 0
EKCF2.489571 0 0 4 0 0
Here are some RNA-seq transcriptomic data showing numbers of mRNA reads present for different patient samples, the rows are patients and the columns are the genes.
FBgn0000017 FBgn0000018 FBgn0000022 FBgn0000024 FBgn0000028 FBgn0000032
untreated1 4664 583 0 10 0 1446
untreated2 8714 761 1 11 1 1713
untreated4 3150 310 0 3 0 672
treated1 6205 722 0 10 0 1698
treated3 3334 308 0 5 1 757
Mass spectroscopy data where we have samples containing informative labels (knockout versus wildtype mice) and protein \(\times\) features designated by their m/z number.
mz 129.9816 72.08144 151.6255 142.0349 169.0413 186.0355
KOGCHUM1 60515 181495 0 196526 25500 51504.40
WTGCHUM1 252579 54697 412 487800 48775 130491.15
WTGCHUM2 187859 56318 46425 454226 45626 100845.01
#######Melanoma/Tcell Data: Peter Lee, Susan Holmes, PNAS.
load(url("http://bios221.stanford.edu/data/Msig3transp.RData"))
round(Msig3transp,2)[1:5,1:6]
## X3968 X14831 X13492 X5108 X16348 X585
## HEA26_EFFE_1 -2.61 -1.19 -0.06 -0.15 0.52 -0.02
## HEA26_MEM_1 -2.26 -0.47 0.28 0.54 -0.37 0.11
## HEA26_NAI_1 -0.27 0.82 0.81 0.72 -0.90 0.75
## MEL36_EFFE_1 -2.24 -1.08 -0.24 -0.18 0.64 0.01
## MEL36_MEM_1 -2.68 -0.15 0.25 0.95 -0.20 0.17
celltypes=factor(substr(rownames(Msig3transp),7,9))
status=factor(substr(rownames(Msig3transp),1,3))
house=read.table("/Users/susan/Dropbox/CaseStudies/votes.txt")
head(house[,1:5])
## V1 V2 V3 V4 V5
## 1 -0.5 -0.5 0.5 -0.5 0.0
## 2 -0.5 -0.5 0.5 -0.5 0.0
## 3 0.5 0.5 -0.5 0.5 -0.5
## 4 0.5 0.5 -0.5 0.5 -0.5
## 5 0.5 0.5 -0.5 0.5 -0.5
## 6 -0.5 -0.5 0.5 -0.5 0.0
party=scan("/Users/susan/Dropbox/CaseStudies/party.txt")
#table(party)
#require(MSBdata)
turtles=read.table(url("http://bios221.stanford.edu/data/PaintedTurtles.txt"),header=TRUE)
turtles[1:4,]
## sex length width height
## 1 f 98 81 38
## 2 f 103 84 38
## 3 f 103 86 42
## 4 f 105 86 40
require(ade4)
data(lizards)
lizards$traits[1:4,c(1,5,6,7,8)]
## mean.L hatch.m clutch.S age.mat clutch.F
## Sa 69.2 0.572 6.0 13 1.5
## Sh 48.4 0.310 3.2 5 2.0
## Tl 168.4 2.235 16.9 19 1.0
## Mc 66.1 0.441 7.2 11 1.5
It is always beneficial to start a multidimensional analysis by checking the simple one dimensional and two dimensional summary statistics, we can visualize these using a graphics package that builds on ‘ggplot2‘ called ‘GGally‘.
What do we mean by low dimensional ?
If we are studying only one variable, just one column of our matrix, we might call it \({\mathbf x}\) or \({\mathbf x}_{\bullet j}\); we call it one dimensional.
A one dimensional summary a histogram that shows that variable’s distribution, or we could compute its mean \(\bar{x}\) or median, these are zero-th dimensional summaries of one dimension data.
When considering two variables (\(x\) and \(y\)) measured together on a set of observations, the correlation coefficient measures how the variables co-vary.
This is a single number summarizes two dimensional data, its formula involves \(\bar{x}\) and \(\bar{y}\): \[\hat{\rho}=\frac{\sum_{i=1}^n (x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum_{i=1}^n (x_i-\bar{x})^2} \sqrt{\sum_{i=1}^n (y_i-\bar{y})^2}} \label{eq:corrcoeff}\]
cor(turtles[,-1])
## length width height
## length 1.000 0.978 0.965
## width 0.978 1.000 0.961
## height 0.965 0.961 1.000
library("GGally")
ggpairs(turtles[,-1],axisLabels = "none")
pairs
plot for athletes
ggpairs(athletes)
library("pheatmap")
pheatmap(cor(athletes),cell.width=10,cell.height=10)
We usually center the cloud of points around the origin; the most common way of doing this is to make new variables whose means are all zero.
More robust scaling can be done also (median).
Different variables are measured in different units, and at different scales, and so would be hard to compare in their original form.
library("ggplot2")
library("factoextra")
## Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ
apply(turtles[,-1],2,sd)
## length width height
## 20.48 12.68 8.39
apply(turtles[,-1],2,mean)
## length width height
## 124.7 95.4 46.3
Making the data have a common standard deviation is the usual transformation. As in the correlation coefficient.
This rescaling is done using the scale function which makes every column have a variance of 1.
turtleMatScale=scale(turtles[,-1])
scaledturtles=data.frame(turtleMatScale,sex=turtles[,1])
apply(scaledturtles[,-4],2,mean)
## length width height
## -1.43e-18 1.94e-17 -2.87e-16
apply(scaledturtles[,-4],2,sd)
## length width height
## 1 1 1
ggplot(scaledturtles,aes(x=width,y=height, group =sex)) +
geom_point(aes(color=sex))
Invented in 1901 by Karl Pearson as a way to reduce a two variable scatterplot to a single coordinate.
Used by statisticians in the 1930s to summarize a battery of psychological tests run on the same subjects Hotelling:1933, extracting overall scores that could summarize many variables at once.
It is called Principal Component Analysis (abbreviated PCA).
Not principled
PCA is an ‘unsupervised learning technique’ because it treats all variables as having the same status.
PCA is visualization technique which produces maps of both variables and observations.
We are going to give you a flavor of what is called multivariate analyses. As a useful first approximation we formulate many of the methods through manipulations called linear algebra.
The raison d’être for multivariate analyses is connections or associations between the different variables.
If the columns of the matrix are unrelated, we should just study each column separately and do standard univariate statistics on them one by one.
Here we show one way of projecting two dimensional data onto a line.
The olympic data come from the ade4
package, they are the performances of decathlon athletes in an olympic competition.
Scatterplot of two variables showing projection on the x coordinate in red.
In general, we lose information about the points when we project down from two dimensions (a plane) to one (a line).
If we do it just by using the original coordinates, for instance the x coordinate as we did above, we lose all the information about the second one.
There are actually many ways of projecting the point cloud onto a line. One is to use what are known as regression lines. Let’s look at these lines and how there are constructed in R:
attach(athletes)
require(ggplot2)
reg1 <- lm(disc~weight,data=athletes)
#abline(reg1, col='red')
a <- reg1$coefficients[1] # Intercept
b <- reg1$coefficients[2] # slope
pline=p+geom_abline(intercept=a,slope=b, col="blue")
proj=pline+geom_segment(aes(x=weight, xend=weight, y=disc,
yend=reg1$fitted),linetype=1,colour="red",
arrow = arrow(length = unit(0.15,"cm")))
print(proj)
The blue line minimizes the sum of squares of the vertical residuals (in red),
What is the variance of the points along the blue line?
matproj=cbind(weight,reg1$fitted)
sum(apply(matproj,2,var))
## [1] 1.65
matproj2=cbind(weight,reg2$fitted)
sum(apply(matproj,2,var))
## [1] 1.65
The orange line minimizes the horizontal residuals for the weight variable in orange.
xy=cbind(athletes$disc,athletes$weight)
svda=svd(xy)
pc = xy %*% svda$v[,1] %*% t(svda$v[,1])
bp = svda$v[2,1] /svda$v[1,1]
ap = mean(pc[,2])-bp*mean(pc[,1])
p+geom_segment(xend=pc[,1],yend=pc[,2])+
geom_abline(intercept=ap,slope=bp, col="purple",lwd=1.5)
The purple line minimizes both residuals and thus (through Pythagoras) it minimizes the sum of squared distances from the points to the line.
Minimizing the distance to the line in both directions, the purple line is the principal component line, the green and blue line are the regression lines.
The lines created here are sensitive to the choice of units; because we have made the standard deviations equal to one for both variables, the PCA line is the diagonal that cuts exactly in the middle of both regression lines.
The data were centered by subtracting their means thus ensuring that the line passes through the origin \((0,0)\).
Compute the variance of the points on the purple line.
The coordinates of the points when we made the plot, these are in the pc
vector:
apply(pc,2,var)
## [1] 0.903 0.903
sum(apply(pc,2,var))
## [1] 1.81
ppdf = data.frame(PC1n=-svda$u[,1]*svda$d[1],PC2n=svda$u[,2]*svda$d[2])
ggplot(ppdf,aes(x=PC1n,y=PC2n))+geom_point()+ ylab("PC2 ")+
geom_hline(yintercept=0,color="purple",lwd=1.5,alpha=0.5) +
geom_point(aes(x=PC1n,y=0),color="red")+ xlab("PC1 ")+
xlim(-3.5, 2.7)+ylim(-2,2)+coord_fixed() +
geom_segment(aes(xend=PC1n,yend=0), color="red")
The line created here is sensitive to the choice of units, and to the center of the cloud.
Note that Pythagoras’ theorem tells us two interesting things here, if we are minimizing in both horizontal and vertical directions we are in fact minimizing the diagonal projections onto the line from each point.
To understand what that a linear combination really is, we can take an analogy, when making a healthy juice mix, you can follow a recipe.
\[V=2\times \mbox{ Beets }+ 1\times \mbox{Carrots } +\frac{1}{2} \mbox{ Gala}+ \frac{1}{2} \mbox{ GrannySmith} +0.02\times \mbox{ Ginger} +0.25 \mbox{ Lemon }\] This recipe is a linear combination of individual juice types, in our analogy these are replaced by the original variables. The result is a new variable, the coefficients \((2,1,\frac{1}{2},\frac{1}{2},0.02,0.25)\) are called the loadings.
A linear combination of variables defines a line in our space in the same way we say lines in the scatterplot plane for two dimensions. As we saw in that case, there are many ways to choose lines onto which we project the data, there is however a ‘best’ line for our purposes.
Total variance can de decomposed The total sums of squares of the distances between the points and any line can be decomposed into the distance to the line and the variance along the line.
We saw that the principal component minimizes the distance to the line, and it also maximizes the variance of the projections along the line.
What is this?
Which projection do you think is better?
It’s the projection that maximizes the area of the shadow and an equivalent measurement is the sums of squares of the distances between points in the projection, we want to see as much of the variation as possible, that’s what PCA does.
PCA is based on the principle of finding the largest axis of inertia/variability and then iterating to find the next best axis which is orthogonal to the previous one and so on.
Eigenvalues of X’X or Singular values of X tell us the rank.
What does rank mean?
X | 2 4 8
---| --------
1 |
2 |
3 |
4 |
X | 2 4 8
-- | ---------
1 | 2
2 | 4
3 | 6
4 | 8
X | 2 4 8
---| --------
1 | 2 4 8
2 | 4 8 16
3 | 6 12 24
4 | 8 16 32
We say that the matrix \[\begin{pmatrix} 2&4&8\\ 4&8&16\\ 6 &12&24\\ 8&16&32\\ \end{pmatrix} \] is of rank one.
\[\begin{pmatrix} 2&4&8\\ 4&8&16\\ 6 &12&24\\ 8&16&32\\ \end{pmatrix}= =u * t(v)= u * v', \qquad u =\left(\begin{smallmatrix} 1 \\2\\ 3\\ 4 \end{smallmatrix}\right) \mbox{ and } v'=t(v)=(2\; 4\; 8) . \]
X=matrix(c(780, 75, 540, 936, 90, 648, 1300, 125, 900,
728, 70, 504),nrow=3)
X
## [,1] [,2] [,3] [,4]
## [1,] 780 936 1300 728
## [2,] 75 90 125 70
## [3,] 540 648 900 504
u1=c(0.8,0.1,0.6)
v1=c(0.4,0.5,0.7,0.4)
sum(u1^2)
## [1] 1.01
sum(v1^2)
## [1] 1.06
s1=2348.2
s1*u1 %*%t(v1)
## [,1] [,2] [,3] [,4]
## [1,] 751.4 939 1315 751.4
## [2,] 93.9 117 164 93.9
## [3,] 563.6 704 986 563.6
X-s1*u1 %*%t(v1)
## [,1] [,2] [,3] [,4]
## [1,] 28.6 -3.28 -15.0 -23.4
## [2,] -18.9 -27.41 -39.4 -23.9
## [3,] -23.6 -56.46 -86.2 -59.6
Matrix \(X\) we would like to decompose.
Areas are proportional to the entries
Looking at different possible margins
Forcing the margins to have norm \(1\)
## ----checkX--------------------------------------------------------------
u1=c(0.8196, 0.0788, 0.5674)
v1=c(0.4053, 0.4863, 0.6754, 0.3782)
s1=2348.2
s1*u1 %*%t(v1)
## [,1] [,2] [,3] [,4]
## [1,] 780 936 1300 728
## [2,] 75 90 125 70
## [3,] 540 648 900 504
Xsub=matrix(c(12.5 , 35.0 , 25.0 , 25,9,14,26,18,16,21,49,
32,18,28,52,36,18,10.5,64.5,36),ncol=4,byrow=T)
Xsub
## [,1] [,2] [,3] [,4]
## [1,] 12.5 35.0 25.0 25
## [2,] 9.0 14.0 26.0 18
## [3,] 16.0 21.0 49.0 32
## [4,] 18.0 28.0 52.0 36
## [5,] 18.0 10.5 64.5 36
USV=svd(Xsub)
USV
## $d
## [1] 1.35e+02 2.81e+01 3.10e-15 1.85e-15
##
## $u
## [,1] [,2] [,3] [,4]
## [1,] -0.344 0.7717 0.5193 -0.114
## [2,] -0.264 0.0713 -0.3086 -0.504
## [3,] -0.475 -0.0415 -0.0386 0.803
## [4,] -0.528 0.1426 -0.6423 -0.103
## [5,] -0.554 -0.6143 0.4702 -0.280
##
## $v
## [,1] [,2] [,3] [,4]
## [1,] -0.250 0.0404 -0.967 0.0244
## [2,] -0.343 0.8798 0.133 0.3010
## [3,] -0.755 -0.4668 0.186 0.4214
## [4,] -0.500 0.0808 0.111 -0.8551
## ----CheckUSV------------------------------------------------------------
Xsub-(135*USV$u[,1]%*%t(USV$v[,1]))
## [,1] [,2] [,3] [,4]
## [1,] 0.8802 19.05 -10.088 1.7604
## [2,] 0.0849 1.76 -0.921 0.1698
## [3,] -0.0396 -1.01 0.565 -0.0792
## [4,] 0.1698 3.53 -1.842 0.3397
## [5,] -0.6877 -15.15 8.069 -1.3754
Xsub-(135*USV$u[,1]%*%t(USV$v[,1]))-(28.1*USV$u[,2]%*%t(USV$v[,2]))
## [,1] [,2] [,3] [,4]
## [1,] 0.00387 -0.02528 0.0335 0.00774
## [2,] 0.00398 0.00264 0.0140 0.00796
## [3,] 0.00749 0.01192 0.0214 0.01498
## [4,] 0.00796 0.00527 0.0281 0.01592
## [5,] 0.00983 0.03784 0.0123 0.01965
Xsub- USV$d[1]*USV$u[,1]%*%t(USV$v[,1])-USV$d[2]*USV$u[,2]%*%t(USV$v[,2])
## [,1] [,2] [,3] [,4]
## [1,] 7.22e-15 -1.07e-14 8.88e-15 4.88e-15
## [2,] 2.04e-15 -6.00e-15 1.05e-14 3.22e-15
## [3,] 2.87e-15 -9.55e-15 1.55e-15 6.23e-15
## [4,] 4.39e-15 -5.77e-15 1.78e-14 7.05e-15
## [5,] 5.11e-15 -1.78e-15 1.78e-14 1.78e-14
Xsub=matrix(c(12.5 , 35.0 , 25.0 , 25,9,14,26,18,16,21,49,32,18,28,52,36,18,10.5,64.5,36),ncol=4,byrow=T)
Xsub
## [,1] [,2] [,3] [,4]
## [1,] 12.5 35.0 25.0 25
## [2,] 9.0 14.0 26.0 18
## [3,] 16.0 21.0 49.0 32
## [4,] 18.0 28.0 52.0 36
## [5,] 18.0 10.5 64.5 36
svd(Xsub)
## $d
## [1] 1.35e+02 2.81e+01 3.10e-15 1.85e-15
##
## $u
## [,1] [,2] [,3] [,4]
## [1,] -0.344 0.7717 0.5193 -0.114
## [2,] -0.264 0.0713 -0.3086 -0.504
## [3,] -0.475 -0.0415 -0.0386 0.803
## [4,] -0.528 0.1426 -0.6423 -0.103
## [5,] -0.554 -0.6143 0.4702 -0.280
##
## $v
## [,1] [,2] [,3] [,4]
## [1,] -0.250 0.0404 -0.967 0.0244
## [2,] -0.343 0.8798 0.133 0.3010
## [3,] -0.755 -0.4668 0.186 0.4214
## [4,] -0.500 0.0808 0.111 -0.8551
USV=svd(Xsub)
XS1=Xsub-USV$d[1]*(USV$u[,1]%*% t(USV$v[,1]))
XS1
## [,1] [,2] [,3] [,4]
## [1,] 0.8748 19.05 -10.104 1.750
## [2,] 0.0808 1.76 -0.933 0.162
## [3,] -0.0470 -1.02 0.543 -0.094
## [4,] 0.1616 3.52 -1.866 0.323
## [5,] -0.6963 -15.16 8.043 -1.393
XS2=XS1-USV$d[2]*(USV$u[,2]%*% t(USV$v[,2]))
XS2
## [,1] [,2] [,3] [,4]
## [1,] 7.22e-15 -1.07e-14 8.88e-15 4.88e-15
## [2,] 2.04e-15 -6.00e-15 1.05e-14 3.22e-15
## [3,] 2.87e-15 -9.55e-15 1.55e-15 6.23e-15
## [4,] 4.39e-15 -5.77e-15 1.78e-14 7.05e-15
## [5,] 5.11e-15 -1.78e-15 1.78e-14 1.78e-14
require(ade4)
HairColor=HairEyeColor[,,2]
HairColor
## Eye
## Hair Brown Blue Hazel Green
## Black 36 9 5 2
## Brown 66 34 29 14
## Red 16 7 7 7
## Blond 4 64 5 8
chisq.test(HairColor)
## Warning in chisq.test(HairColor): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: HairColor
## X-squared = 100, df = 9, p-value <2e-16
prows=sweep(HairColor,1,apply(HairColor,1,sum),"/")
pcols=sweep(HairColor,2,apply(HairColor,2,sum),"/")
Indep=313*as.matrix(prows)%*%t(as.matrix(pcols))
round(Indep)
## Hair
## Hair Black Brown Red Blond
## Black 72 158 39 44
## Brown 57 154 40 61
## Red 55 155 44 59
## Blond 28 108 27 149
sum((Indep-HairColor)^2/Indep)
## [1] 799
## ------------------------------------------------------------------------
diabetes.svd=svd(scale(diabetes[,-5]))
names(diabetes.svd)
## [1] "d" "u" "v"
diabetes.svd$d
## [1] 20.09 13.38 9.89 5.63 1.70
turtles.svd=svd(scale(turtles[,-1]))
turtles.svd$d
## [1] 11.75 1.42 1.00
\[{\Large X\qquad = u_{\bullet 1} * s_1 * v_{\bullet 1}\qquad + \qquad u_{\bullet 2} * s_2 * v_{\bullet 2}\qquad + \qquad u_{\bullet 3} * s_3 * v_{\bullet 3}+}\]
We write our horizontal/vertical decompostion of the matrix \(X\) in short hand as: \[{\Large X = USV', V'V=I, U'U=I, S} \mbox{ diagonal matrix of singular values, given by the {\tt d} component in the R function}\] The crossproduct of X with itself verifies \[{\Large X'X=VSU'USV'=VS^2V'=V\Lambda V'}\] where \(V\) is called the eigenvector matrix of the symmetric matrix \(X'X\) and \(\Lambda\) is the diagonal matrix of eigenvalues of \(X'X\).
The singular vectors from the singular value decomposition, svd
function above tell us the coefficients to put in front of the old variables to make our new ones with better properties. We write this as : \[PC_1=c_1 X_{\bullet 1} +c_2 X_{\bullet 2}+ c_3 X_{\bullet 3}+\cdots c_p X_{\bullet p}\]
Replace \(X_{\bullet 1},X_{\bullet 2}, \ldots X_{\bullet p}\) by \[PC_1, PC_2, \ldots PC_k\]
Suppose we have 5 samples with 23,000 genes measured on them, what is the dimensionality of these data?
The number of principal components is less than or equal to the number of original variables. \[K\leq min(n,p)\]
The geometr(ies) of data: good trick look at size of vectors.
The Principal Component transformation is defined in such a way that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each successive component in turn has the highest variance possible under the constraint that it be orthogonal to the preceding components. \[\max_{aX} \mbox{var}(Proj_{aX} (X))\]
Suppose the matrix of data \(X\) has been made to have column means 0 and standard deviations 1.
We call the principal components the columns of the matrix, \(C=US\).
The columns of U (the matrix given as USV$u in the output from the svd function above) are rescaled to have norm \(s^2\), the variance they are responsable for.
If the matrix \(X\) comes from the study of \(n\) different samples or specimens, then the principal components provides new coordinates for these \(n\) points these are sometimes also called the scores in some of the (many) PCA functions available in R (princomp
,prcomp
,dudi.pca
in ade4
).
If we only want the first one then it is just \(c_1=s_1 u_1\).
Variance explained by first principal component: \(s_1^2\):
Notice that \(||c_1||^2=s_1'u_1 u_1' s_1= s_1^2 u_1'u_1=s_1^2=\lambda_1\)
\[X'C=VSU'US=VS^2\]
We start with the turtles data that has 3 continuous variables and a gender variable that we leave out for the original PCA analysis.
When computing the variance covariance matrix, many programs use 1/(n-1) as the denominator, here n=48 so the sum of the variances are off by a small fudge factor of 48/47.
turtles3var=turtles[,-1]
apply(turtles3var,2,mean)
## length width height
## 124.7 95.4 46.3
turtles.pca=princomp(turtles3var)
print(turtles.pca)
## Call:
## princomp(x = turtles3var)
##
## Standard deviations:
## Comp.1 Comp.2 Comp.3
## 25.06 2.26 1.94
##
## 3 variables and 48 observations.
(25.06^2+2.26^2+1.94^2)*(48/47)
## [1] 650
apply(turtles3var,2,var)
## length width height
## 419.5 160.7 70.4
apply(turtles[,-1],2,sd)
## length width height
## 20.48 12.68 8.39
turtlesc=scale(turtles[,-1])
cor(turtlesc)
## length width height
## length 1.000 0.978 0.965
## width 0.978 1.000 0.961
## height 0.965 0.961 1.000
pca1=princomp(turtlesc)
pca1
## Call:
## princomp(x = turtlesc)
##
## Standard deviations:
## Comp.1 Comp.2 Comp.3
## 1.695 0.205 0.145
##
## 3 variables and 48 observations.
The screeplot showing the eigenvalues for the standardized data: one very large component in this case and two very small ones, the data are (almost) one dimensional.
pca.turtles=dudi.pca(turtles[,-1],scannf=F,nf=2)
scatter(pca.turtles)
Choose k carefully:
scatter(pca.turtles)
This data set describes 18 lizards as reported by Bauwens and D'iaz-Uriarte (1997). It also gives life-history traits corresponding to these 18 species.
mean.L
(mean length (mm)), matur.L
(length at maturity (mm)),max.L
(maximum length (mm)), hatch.L
(hatchling length (mm)),hatch.m
(hatchling mass (g)), clutch.S
(Clutch size),age.mat
(age at maturity (number of months of activity)),clutch.F
(clutch frequency).library(ade4)
data(lizards)
names(lizards)
## [1] "traits" "hprA" "hprB"
lizards$traits[1:4,]
## mean.L matur.L max.L hatch.L hatch.m clutch.S age.mat clutch.F
## Sa 69.2 58 82 27.8 0.572 6.0 13 1.5
## Sh 48.4 42 56 22.9 0.310 3.2 5 2.0
## Tl 168.4 132 190 42.8 2.235 16.9 19 1.0
## Mc 66.1 56 72 25.0 0.441 7.2 11 1.5
It is always a good idea to check the variables one at a time and two at a time to see what the basic statistics are for the data
tabtraits=lizards$traits
options(digits=2)
colMeans(tabtraits)
## mean.L matur.L max.L hatch.L hatch.m clutch.S age.mat clutch.F
## 71.34 59.39 82.83 26.88 0.56 5.87 10.89 1.56
cor(tabtraits)
## mean.L matur.L max.L hatch.L hatch.m clutch.S age.mat clutch.F
## mean.L 1.00 0.99 0.99 0.89 0.94 0.92 0.77 -0.48
## matur.L 0.99 1.00 0.99 0.90 0.92 0.92 0.79 -0.49
## max.L 0.99 0.99 1.00 0.88 0.92 0.91 0.78 -0.51
## hatch.L 0.89 0.90 0.88 1.00 0.96 0.72 0.58 -0.42
## hatch.m 0.94 0.92 0.92 0.96 1.00 0.78 0.64 -0.45
## clutch.S 0.92 0.92 0.91 0.72 0.78 1.00 0.81 -0.55
## age.mat 0.77 0.79 0.78 0.58 0.64 0.81 1.00 -0.62
## clutch.F -0.48 -0.49 -0.51 -0.42 -0.45 -0.55 -0.62 1.00
require(ade4)
res=dudi.pca(tabtraits,scannf=F,nf=2)
barplot(res$eig)
res
## Duality diagramm
## class: pca dudi
## $call: dudi.pca(df = tabtraits, scannf = F, nf = 2)
##
## $nf: 2 axis-components saved
## $rank: 8
## eigen values: 6.5 0.83 0.42 0.17 0.045 ...
## vector length mode content
## 1 $cw 8 numeric column weights
## 2 $lw 18 numeric row weights
## 3 $eig 8 numeric eigen values
##
## data.frame nrow ncol content
## 1 $tab 18 8 modified array
## 2 $li 18 2 row coordinates
## 3 $l1 18 2 row normed scores
## 4 $co 8 2 column coordinates
## 5 $c1 8 2 column normed scores
## other elements: cent norm
biplot(res)
res$eig/(sum(res$eig))
## [1] 0.81118 0.10387 0.05219 0.02133 0.00563 0.00488 0.00061 0.00031
round(cor(athletes),1)
## m100 long weight highj m400 m110 disc pole javel m1500
## m100 1.0 -0.5 -0.2 -0.1 0.6 0.6 0.0 -0.4 -0.1 0.3
## long -0.5 1.0 0.1 0.3 -0.5 -0.5 0.0 0.3 0.2 -0.4
## weight -0.2 0.1 1.0 0.1 0.1 -0.3 0.8 0.5 0.6 0.3
## highj -0.1 0.3 0.1 1.0 -0.1 -0.3 0.1 0.2 0.1 -0.1
## m400 0.6 -0.5 0.1 -0.1 1.0 0.5 0.1 -0.3 0.1 0.6
## m110 0.6 -0.5 -0.3 -0.3 0.5 1.0 -0.1 -0.5 -0.1 0.1
## disc 0.0 0.0 0.8 0.1 0.1 -0.1 1.0 0.3 0.4 0.4
## pole -0.4 0.3 0.5 0.2 -0.3 -0.5 0.3 1.0 0.3 0.0
## javel -0.1 0.2 0.6 0.1 0.1 -0.1 0.4 0.3 1.0 0.1
## m1500 0.3 -0.4 0.3 -0.1 0.6 0.1 0.4 0.0 0.1 1.0
pca.ath <- dudi.pca(athletes, scan = F)
pca.ath$eig
## [1] 3.42 2.61 0.94 0.88 0.56 0.49 0.43 0.31 0.27 0.10
barplot(pca.ath$eig)
The screeplot is the first thing to look at, it tells us that it is satifactory to use a two dimensional plot.
s.corcircle(pca.ath$co,clab=1, grid=FALSE, fullcircle = TRUE,box=FALSE)
The correlation circle made by showing the projection of the old variables onto the two first new principal axes.
athletes[,c(1,5,6,10)]=-athletes[,c(1,5,6,10)]
round(cor(athletes),1)
## m100 long weight highj m400 m110 disc pole javel m1500
## m100 1.0 0.5 0.2 0.1 0.6 0.6 0.0 0.4 0.1 0.3
## long 0.5 1.0 0.1 0.3 0.5 0.5 0.0 0.3 0.2 0.4
## weight 0.2 0.1 1.0 0.1 -0.1 0.3 0.8 0.5 0.6 -0.3
## highj 0.1 0.3 0.1 1.0 0.1 0.3 0.1 0.2 0.1 0.1
## m400 0.6 0.5 -0.1 0.1 1.0 0.5 -0.1 0.3 -0.1 0.6
## m110 0.6 0.5 0.3 0.3 0.5 1.0 0.1 0.5 0.1 0.1
## disc 0.0 0.0 0.8 0.1 -0.1 0.1 1.0 0.3 0.4 -0.4
## pole 0.4 0.3 0.5 0.2 0.3 0.5 0.3 1.0 0.3 0.0
## javel 0.1 0.2 0.6 0.1 -0.1 0.1 0.4 0.3 1.0 -0.1
## m1500 0.3 0.4 -0.3 0.1 0.6 0.1 -0.4 0.0 -0.1 1.0
pcan.ath=dudi.pca(athletes,nf=2,scannf=F)
pcan.ath$eig
## [1] 3.42 2.61 0.94 0.88 0.56 0.49 0.43 0.31 0.27 0.10
Now all the negative correlations are quite small ones. Doing the screeplot over again will show no change in the eigenvalues, the only thing that changes is the sign of loadings for the m variables.
s.corcircle(pcan.ath$co,clab=1.2,box=FALSE)
Correlation circle after changing the signs of the running variables.
## Warning: Removed 1 rows containing missing values (geom_text).
data(olympic)
olympic$score
## [1] 8488 8399 8328 8306 8286 8272 8216 8189 8180 8167 8143 8114 8093 8083 8036
## [16] 8021 7869 7860 7859 7781 7753 7745 7743 7623 7579 7517 7505 7422 7310 7237
## [31] 7231 7016 6907
Scatterplot of the scores given as a supplementary variable and the first principal component, the points are labeled by their order in the data set.
######center and scale the data
###(they have already had variance normalization applied to them)
res.Msig3=dudi.pca(Msig3transp,center=TRUE,scale=TRUE,scannf=F,nf=4)
screeplot(res.Msig3,main="")
celltypes=factor(substr(rownames(Msig3transp),7,9))
table(celltypes)
## celltypes
## EFF MEM NAI
## 10 9 11
status=factor(substr(rownames(Msig3transp),1,3))
require(ggplot2)
gg <- cbind(res.Msig3$li,Cluster=celltypes)
gg <- cbind(sample=rownames(gg),gg)
ggplot(gg, aes(x=Axis1, y=Axis2)) +
geom_point(aes(color=factor(Cluster)),size=5) +
geom_hline(yintercept=0,linetype=2) +
geom_vline(xintercept=0,linetype=2) +
scale_color_discrete(name="Cluster") +
coord_fixed()+ ylim(-8,+8)
xlim(-14,18)
## <ScaleContinuousPosition>
## Range:
## Limits: -14 -- 18
PCA of gene expression for a subset of 156 genes involved in specificities of each of the three separate T cell types: effector, naive and memory
Example from paper: Kashnap et al, PNAS, 2013
###Just for record, this is how the matrix was made
require(xcms)
## Loading required package: xcms
## Loading required package: Biobase
## Loading required package: BiocGenerics
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
## The following object is masked from 'package:ade4':
##
## score
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, basename, cbind, colMeans,
## colnames, colSums, dirname, do.call, duplicated, eval, evalq,
## Filter, Find, get, grep, grepl, intersect, is.unsorted, lapply,
## lengths, Map, mapply, match, mget, order, paste, pmax, pmax.int,
## pmin, pmin.int, Position, rank, rbind, Reduce, rowMeans, rownames,
## rowSums, sapply, setdiff, sort, table, tapply, union, unique,
## unsplit, which, which.max, which.min
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
## Loading required package: BiocParallel
## Loading required package: MSnbase
## Warning: package 'MSnbase' was built under R version 3.5.1
## Loading required package: mzR
## Loading required package: Rcpp
## Warning in fun(libname, pkgname): mzR has been built against a different Rcpp version (0.12.16)
## than is installed on your system (0.12.18). This might lead to errors
## when loading mzR. If you encounter such issues, please send a report,
## including the output of sessionInfo() to the Bioc support forum at
## https://support.bioconductor.org/. For details see also
## https://github.com/sneumann/mzR/wiki/mzR-Rcpp-compiler-linker-issue.
## Loading required package: ProtGenerics
##
## This is MSnbase version 2.6.2
## Visit https://lgatto.github.io/MSnbase/ to get started.
##
## Attaching package: 'MSnbase'
## The following object is masked from 'package:stats':
##
## smooth
## The following object is masked from 'package:base':
##
## trimws
##
## This is xcms version 3.2.0
##
## Attaching package: 'xcms'
## The following object is masked from 'package:knitr':
##
## stitch
## The following object is masked from 'package:stats':
##
## sigma
load(url("http://bios221.stanford.edu/data/xset3.RData"))
mat1 =groupval(xset3, value="into")
##
head(mat1)
## ko15 ko16 ko18 ko19 ko21 ko22 wt15 wt16
## 200.1/2927 147888 103548 65290 60144 85156 162012 175177 82619
## 205/2791 1778569 1567038 1482796 1039130 1223132 1072038 1950287 1466781
## 206/2791 237994 269714 201393 150107 176990 156797 276542 222366
## 207.1/2719 380873 460630 351750 219288 286849 235023 417170 324892
## 219.1/2524 235545 173623 82365 79480 185792 174459 244584 161184
## 231/2516 0 70796 222609 286232 435094 100076 0 73142
## wt18 wt19 wt21 wt22
## 200.1/2927 51943 69198 153273 98144
## 205/2791 1572679 1275313 1356014 1231442
## 206/2791 211718 186851 188286 172349
## 207.1/2719 277991 220972 252874 236728
## 219.1/2524 72029 75097 238194 173830
## 231/2516 165383 240261 201316 179438
dim(mat1)
## [1] 399 12
## Matrix with with samples in rows and variables as columns
tmat= t(mat1)
head(tmat[,1:10])
## 200.1/2927 205/2791 206/2791 207.1/2719 219.1/2524 231/2516 233/3023
## ko15 147888 1778569 237994 380873 235545 0 399145
## ko16 103548 1567038 269714 460630 173623 70796 356951
## ko18 65290 1482796 201393 351750 82365 222609 410551
## ko19 60144 1039130 150107 219288 79480 286232 198417
## ko21 85156 1223132 176990 286849 185792 435094 363382
## ko22 162012 1072038 156797 235023 174459 100076 317806
## 234/3024 236.1/2524 240.2/3681
## ko15 76881 252282 112441
## ko16 99480 206032 153376
## ko18 97428 71764 193769
## ko19 53440 67643 170641
## ko21 88228 186661 88800
## ko22 81072 198804 146563
logtmat=log(tmat+1)
## PCA Example
require(ade4)
require(ggplot2)
load(url("http://bios221.stanford.edu/data/logtmat.RData"))
pca.result=dudi.pca(logtmat, scannf=F,nf=3)
labs=rownames(pca.result$li)
nos=substr(labs,3,4)
type=as.factor(substr(labs,1,2))
kos=which(type=="ko")
wts=which(type=="wt")
pcs=data.frame(Axis1=pca.result$li[,1],Axis2=pca.result$li[,2],labs,type)
pcsplot=ggplot(pcs,aes(x=Axis1,y=Axis2,label=labs,group=nos,colour=type)) +
geom_text(size=4,vjust=-0.5) + geom_point()
pcsplot + geom_hline(yintercept=0,linetype=2) +coord_fixed() + ylim(-12,18) +
geom_vline(xintercept=0,linetype=2)
pcsplot+geom_line(colour="red") + coord_fixed() + ylim(-12,18)
Phylochip data allowed us to discover a batch effect (phylochip).
Columns of the DataFrame represent different attributes of the features of interest, e.g., gene or transcript IDs, etc.
An example of hybrid data container from single cell experiments (see Bioconductor workflow in Perraudeau, 2017 for more details).
After the pre-processing and normalization steps prescribed in the workflow, we retain the 1,000 most variable genes measured on 747 cells.
require(SummarizedExperiment)
## Loading required package: SummarizedExperiment
## Loading required package: GenomicRanges
## Warning: package 'GenomicRanges' was built under R version 3.5.1
## Loading required package: stats4
## Loading required package: S4Vectors
##
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:base':
##
## expand.grid
## Loading required package: IRanges
##
## Attaching package: 'IRanges'
## The following object is masked from 'package:xcms':
##
## distance
## Loading required package: GenomeInfoDb
## Loading required package: DelayedArray
## Warning: package 'DelayedArray' was built under R version 3.5.1
## Loading required package: matrixStats
##
## Attaching package: 'matrixStats'
## The following objects are masked from 'package:Biobase':
##
## anyMissing, rowMedians
##
## Attaching package: 'DelayedArray'
## The following objects are masked from 'package:matrixStats':
##
## colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges
## The following objects are masked from 'package:base':
##
## aperm, apply
corese = readRDS(path.expand("~/Books/CUBook/data/normse.rds"))
norm = assays(corese)$normalizedValues
We can look at a PCA of the normalized values and check graphically that the batch effect has been removed:
respca = dudi.pca(t(norm), nf = 3, scannf = FALSE)
plotbar(respca, 15)
## Error in plotbar(respca, 15): could not find function "plotbar"
PCS = respca$li[, 1:3]
Since the screeplot shows us that we must not dissociate axes 2 and 3, we will make a three dimensional plot with the rgl package.
library("rgl")
batch = colData(corese)$Batch
plot3d(PCS,aspect=sqrt(c(84,24,20)),col=col_clus[batch])
plot3d(PCS,aspect=sqrt(c(84,24,20)),
col = col_clus[as.character(publishedClusters)])
With plotly:
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:IRanges':
##
## slice
## The following object is masked from 'package:S4Vectors':
##
## rename
## The following object is masked from 'package:xcms':
##
## groups
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
p <- plot_ly(PCS,x=~Axis1,y=~Axis2,z=~Axis3,color=batch) %>% add_markers()
conscious
preprocessing, to make their variances comparable and their centers at the origin.more informative
variables which are linear combinations of the old ones.See all the details here: Chapter on Multivariate
require(ggbiplot)
## Loading required package: ggbiplot
## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'ggbiplot'
data(wine)
## Warning in data(wine): data set 'wine' not found
wine[1:3,1:7]
## Error in eval(expr, envir, enclos): object 'wine' not found
heatmap(1-cor(wine))
## Error in cor(wine): object 'wine' not found
wine.pca <- prcomp(wine, scale. = TRUE)
## Error in prcomp(wine, scale. = TRUE): object 'wine' not found
table(wine.class)
## Error in eval(quote(list(...)), env): object 'wine.class' not found
fviz_pca_biplot(wine.pca,
habillage = wine.class, addEllipses = TRUE, circle = TRUE)
## Error in .is_grouping_var(habillage): object 'wine.class' not found
We’ll see later when we look at Microbiome data that sometimes, this projection can be problematic.
require(ade4)
res.ath=dudi.pca(athletes,nf=2,scannf=F)
inertia.dudi(res.ath,col.inertia=TRUE)
## Inertia information:
## Call: inertia.dudi(x = res.ath, col.inertia = TRUE)
##
## Decomposition of total inertia:
## inertia cum cum(%)
## Ax1 3.4182 3.418 34.18
## Ax2 2.6064 6.025 60.25
## Ax3 0.9433 6.968 69.68
## Ax4 0.8780 7.846 78.46
## Ax5 0.5566 8.403 84.03
## Ax6 0.4912 8.894 88.94
## Ax7 0.4306 9.324 93.24
## Ax8 0.3068 9.631 96.31
## Ax9 0.2669 9.898 98.98
## Ax10 0.1019 10.000 100.00
##
## Column contributions (%):
## m100 long weight highj m400 m110 disc pole javel m1500
## 10 10 10 10 10 10 10 10 10 10
##
## Column absolute contributions (%):
## Axis1(%) Axis2(%)
## m100 17.296 2.21439
## long 15.528 2.31288
## weight 7.242 23.38084
## highj 4.506 0.07783
## m400 12.663 12.40165
## m110 18.791 0.48397
## disc 3.090 25.33458
## pole 14.752 2.23748
## javel 3.238 13.83520
## m1500 2.895 17.72118
##
## Signed column relative contributions:
## Axis1 Axis2
## m100 -59.121 -5.7716
## long -53.077 -6.0283
## weight -24.754 60.9397
## highj -15.404 0.2029
## m400 -43.284 -32.3236
## m110 -64.231 -1.2614
## disc -10.563 66.0319
## pole -50.426 5.8317
## javel -11.068 36.0600
## m1500 -9.895 -46.1884
##
## Cumulative sum of column relative contributions (%):
## Axis1 Axis1:2 Axis3:10
## m100 59.121 64.89 35.11
## long 53.077 59.11 40.89
## weight 24.754 85.69 14.31
## highj 15.404 15.61 84.39
## m400 43.284 75.61 24.39
## m110 64.231 65.49 34.51
## disc 10.563 76.60 23.40
## pole 50.426 56.26 43.74
## javel 11.068 47.13 52.87
## m1500 9.895 56.08 43.92
Contributions are printed in 1/10000 and the sign is the sign of the coordinate.
Most of these are available through the dist
function, the vegdist
function in the vegan
package:
library(vegan)
## Loading required package: permute
## Loading required package: lattice
##
## Attaching package: 'lattice'
## The following object is masked from 'package:xcms':
##
## levelplot
## This is vegan 2.5-2
##
## Attaching package: 'vegan'
## The following object is masked from 'package:xcms':
##
## calibrate
## The following object is masked from 'package:mzR':
##
## tolerance
help(vegdist)
manhattan
)dist.dna
in ape
)SMC = function(p,q) {
# Compute F01,F10,F11,F00
F01 = sum((p == 0) & (q == 1))
F10 = sum((p == 1) & (q == 0))
F00 = sum((p == 0) & (q == 0))
F11 = sum((p == 1) & (q == 1))
return((F11+F00)/(F01+F10+F00+F11))
}
# function for computing Jaccard coefficient
JC = function(p,q) {
# Compute F01,F10,F11,F00
F01 = sum((p == 0) & (q == 1))
F10 = sum((p == 1) & (q == 0))
F11 = sum((p == 1) & (q == 1))
return(F11/(F01+F10+F11))
}
cos.sim = function(p,q) {
return(sum(p*q) / sqrt(sum(p^2)*sum(q^2)))
}
d1 = c(3,2,0,5,0,0,0,2,0,0)
d2 = c(1,0,0,0,0,0,0,1,0,2)
print(cos.sim(d1,d2))
## [1] 0.31
p=c(rep(0,6),rep(1,4))
p
## [1] 0 0 0 0 0 0 1 1 1 1
q=c(rep(0,6),1,0,0,1)
q
## [1] 0 0 0 0 0 0 1 0 0 1
print(JC(p,q))
## [1] 0.5
print(SMC(p,q))
## [1] 0.8
require(ade4)
data(olympic)
disto=dist(scale(olympic$tab))
str(disto)
## 'dist' num [1:528] 4.36 4.11 4.18 5.19 4.28 ...
## - attr(*, "Size")= int 33
## - attr(*, "Labels")= chr [1:33] "1" "2" "3" "4" ...
## - attr(*, "Diag")= logi FALSE
## - attr(*, "Upper")= logi FALSE
## - attr(*, "method")= chr "euclidean"
## - attr(*, "call")= language dist(x = scale(olympic$tab))
length(disto)
## [1] 528
as.matrix(disto)[1:5,1:5]
## 1 2 3 4 5
## 1 0.0 4.4 4.1 4.2 5.2
## 2 4.4 0.0 1.9 2.2 2.4
## 3 4.1 1.9 0.0 3.2 2.2
## 4 4.2 2.2 3.2 0.0 4.0
## 5 5.2 2.4 2.2 4.0 0.0
33*32/2
## [1] 528
Suppose we are given a matrix of dissimilarities or distances and we want to build a useful map of the observations.
require(graphics)
data(eurodist)
# look at raw distances
as.matrix(eurodist)[1:7,1:7]
## Athens Barcelona Brussels Calais Cherbourg Cologne Copenhagen
## Athens 0 3313 2963 3175 3339 2762 3276
## Barcelona 3313 0 1318 1326 1294 1498 2218
## Brussels 2963 1318 0 204 583 206 966
## Calais 3175 1326 204 0 460 409 1136
## Cherbourg 3339 1294 583 460 0 785 1545
## Cologne 2762 1498 206 409 785 0 760
## Copenhagen 3276 2218 966 1136 1545 760 0
# graphical representation
heatmap(as.matrix(eurodist))
These were computed as actualy distances across land so we actually know that we could find a 2-3 dimensional map that would represent the data well.
eck=read.table("http://bios221.stanford.edu/data/eckman.txt",header=TRUE)
nc=nrow(eck)
eck[1:9,1:9]
## w434 w445 w465 w472 w490 w504 w537 w555 w584
## 1 0.00 0.86 0.42 0.42 0.18 0.06 0.07 0.04 0.02
## 2 0.86 0.00 0.50 0.44 0.22 0.09 0.07 0.07 0.02
## 3 0.42 0.50 0.00 0.81 0.47 0.17 0.10 0.08 0.02
## 4 0.42 0.44 0.81 0.00 0.54 0.25 0.10 0.09 0.02
## 5 0.18 0.22 0.47 0.54 0.00 0.61 0.31 0.26 0.07
## 6 0.06 0.09 0.17 0.25 0.61 0.00 0.62 0.45 0.14
## 7 0.07 0.07 0.10 0.10 0.31 0.62 0.00 0.73 0.22
## 8 0.04 0.07 0.08 0.09 0.26 0.45 0.73 0.00 0.33
## 9 0.02 0.02 0.02 0.02 0.07 0.14 0.22 0.33 0.00
require(ade4)
queck=quasieuclid(as.dist(1-eck))
eck.pco=dudi.pco(queck,scannf=F,nf=2)
names(eck.pco)
## [1] "eig" "rank" "nf" "cw" "tab" "li" "l1" "c1" "co" "lw"
## [11] "call"
scatter(eck.pco,posi="bottomright")
Other available analysis and plotting options
require(vegan)
require(ggplot2)
ggscree=function(out){
n=length(out)
xs=1:n
df=data.frame(eig=out,xs)
ggplot(data=df, aes(x=xs, y=eig)) +
geom_bar(stat="identity",width=0.3,color="red",fill="orange")
}
res=cmdscale(as.dist(1-eck))
res
## [,1] [,2]
## w434 -0.21 -0.419
## w445 -0.26 -0.411
## w465 -0.41 -0.309
## w472 -0.44 -0.273
## w490 -0.44 0.075
## w504 -0.34 0.373
## w537 -0.24 0.477
## w555 -0.19 0.488
## w584 0.24 0.297
## w600 0.40 0.153
## w610 0.50 -0.029
## w628 0.50 -0.105
## w651 0.46 -0.148
## w674 0.43 -0.171
res=cmdscale(as.dist(1-eck),eig=TRUE)
names(res)
## [1] "points" "eig" "x" "ac" "GOF"
res$eig
## [1] 2.0e+00 1.3e+00 4.4e-01 3.7e-01 1.6e-01 1.0e-01 4.2e-02 3.2e-02
## [9] 1.8e-02 4.2e-03 1.3e-03 2.2e-16 -2.7e-02 -4.7e-02
out=res$eig
col14=rainbow(14)
ggscree(out)
MDS = data.frame(PCo1 = res$points[,1], PCo2 = res$points[,2])
ggplot(data = MDS, aes(PCo1, PCo2)) +
geom_point(size=4,color = col14)
Given a Euclidean distance \(D\) between the observation-rows.\ Call \(B=XX'\), if \(D^{(2)}\) is the matrix of squared distances between rows of \(X\) in the euclidean coordinates, we can show that \[-\frac{1}{2} HD^{(2)} H=B\]
\[H=I-\frac{1}{n}11'\] is the centering matrix.
\[d_{i,j} = \sqrt{(x_i^1 - x_j^1)^2 + \dots + (x_i^p - x_j^p)^2}. \mbox{ and } -\frac{1}{2} HD^{(2)} H=B\]
We can go backwards from a matrix \(D\) to \(X\) by taking the eigendecomposition of \(B\) in much the same way that PCA provides the best rank \(r\) approximation for data by taking the singular value decomposition of \(X\), or the eigendecomposition of \(XX'\).
\[X^{(r)}=US^{(r)}V'\mbox{ with } S^{(r)}= \begin{pmatrix} s_1 &0 & 0 &0 &...\\ 0&s_2&0 & 0 &...\\ 0& 0& ...& ... & ...\\ 0 & 0 & ... & s_r &...\\ ...& ...& ...& 0 & 0 \\ \end{pmatrix} \] This provides the best approximate representation in an Euclidean space of dimension r. The algorithm provides points in a Euclidean space that have approximately the same distances as those provided by \(D^2\).
In summary, given an \(n \times n\) matrix of interpoint distances \(D\), one can solve for points achieving these distances by: 1. Double centering the interpoint distance squared matrix: \(B = -\frac{1}{2}H D^2 H\). 2. Diagonalizing \(B\): \(B = U \Lambda U^T\). 3. Extracting \(\tilde{X}\): \(\tilde{X} = U \Lambda^{1/2}\).
res=cmdscale(eurodist,eig=TRUE)
n=length(res$eig)
out=data.frame(k=1:n,eig=res$eig)
ggplot(data=out, aes(x=k, y=eig)) +
geom_bar(stat="identity",width=0.5,color="orange",fill="pink")
MDS = data.frame(PCo1 = res$points[,1], PCo2 = res$points[,2],labs=rownames(res$points))
ggplot(data = MDS, aes(x=PCo1, y=PCo2,label=labs) ) + geom_text()
Notice the orientation. We also do not have `variables’ to interpret.
Respects the order of the distances, does not try to approximate their actual values.
require(vegan)
require(ggplot2)
res=metaMDS(as.dist(1-eck))
## Run 0 stress 0.023
## Run 1 stress 0.023
## ... Procrustes: rmse 9.8e-07 max resid 1.8e-06
## ... Similar to previous best
## Run 2 stress 0.023
## ... New best solution
## ... Procrustes: rmse 3.1e-06 max resid 4.7e-06
## ... Similar to previous best
## Run 3 stress 0.023
## ... New best solution
## ... Procrustes: rmse 2.1e-06 max resid 3.4e-06
## ... Similar to previous best
## Run 4 stress 0.023
## ... Procrustes: rmse 1.6e-06 max resid 2.7e-06
## ... Similar to previous best
## Run 5 stress 0.023
## ... Procrustes: rmse 2.7e-06 max resid 4.6e-06
## ... Similar to previous best
## Run 6 stress 0.023
## ... Procrustes: rmse 2e-06 max resid 4e-06
## ... Similar to previous best
## Run 7 stress 0.023
## ... Procrustes: rmse 3.1e-06 max resid 4.9e-06
## ... Similar to previous best
## Run 8 stress 0.023
## ... New best solution
## ... Procrustes: rmse 1.5e-06 max resid 2.2e-06
## ... Similar to previous best
## Run 9 stress 0.023
## ... Procrustes: rmse 2.3e-06 max resid 4.1e-06
## ... Similar to previous best
## Run 10 stress 0.023
## ... Procrustes: rmse 3.1e-06 max resid 5e-06
## ... Similar to previous best
## Run 11 stress 0.023
## ... Procrustes: rmse 2.2e-06 max resid 3.9e-06
## ... Similar to previous best
## Run 12 stress 0.023
## ... Procrustes: rmse 1.3e-06 max resid 2.2e-06
## ... Similar to previous best
## Run 13 stress 0.023
## ... New best solution
## ... Procrustes: rmse 1.1e-06 max resid 1.8e-06
## ... Similar to previous best
## Run 14 stress 0.023
## ... Procrustes: rmse 3.7e-06 max resid 6.3e-06
## ... Similar to previous best
## Run 15 stress 0.023
## ... Procrustes: rmse 1.9e-06 max resid 3.4e-06
## ... Similar to previous best
## Run 16 stress 0.023
## ... Procrustes: rmse 5.9e-06 max resid 9.9e-06
## ... Similar to previous best
## Run 17 stress 0.023
## ... Procrustes: rmse 4.6e-06 max resid 7.8e-06
## ... Similar to previous best
## Run 18 stress 0.023
## ... Procrustes: rmse 2.2e-06 max resid 4e-06
## ... Similar to previous best
## Run 19 stress 0.023
## ... New best solution
## ... Procrustes: rmse 5.7e-07 max resid 8.8e-07
## ... Similar to previous best
## Run 20 stress 0.023
## ... Procrustes: rmse 3.6e-06 max resid 6.2e-06
## ... Similar to previous best
## *** Solution reached
col14=rainbow(14)
NMDS = data.frame(PCo1 = res$points[,1], PCo2 = res$points[,2])
ggplot(data = NMDS, aes(PCo1, PCo2)) +
geom_point(size=4,color = col14)
\[ {\small \begin{array}{rrrrrrrr} \hline & black & blue & green & grey & orange & purple & white \\ \hline quiet & 27700 & 21500 & 21400 & 8750 & 12200 & 8210 & 25100 \\ angry & 29700 & 15300 & 17400 & 7520 & 10400 & 7100 & 17300 \\ clever & 16500 & 12700 & 13200 & 4950 & 6930 & 4160 & 14200 \\ depressed &\hskip-0.3cm 14800 & 9570 & 9830 & 1470 & 3300 & 1020 & 12700 \\ happy & 193000 & 83100 & 87300 & 19200 & 42200 & 26100 & 91500 \\ lively & 18400 & 12500 & 13500 & 6590 & 6210 & 4880 & 14800 \\ perplexed &\hskip-0.3cm 1100 & 713 & 801 & 189 & 233 & 152 & 1090 \\ virtuous & 1790 & 802 & 1020 & 200 & 247 & 173 & 1650 \\ \hline \end{array} } \]
(the long version representation)
Indicator variables: \[\begin{array}{lrrrrrr} Patient & Mutation 1 & Mutation 2 & $\ldots & \\ AHY789 & 0 & 0 & \\ AHX717 & 1 & 0 & \\ \end{array} \] are transformed into a Mutation \(\times\) Mutation matrix.
require(ade4)
HairColor=HairEyeColor[,,2]
HairColor
## Eye
## Hair Brown Blue Hazel Green
## Black 36 9 5 2
## Brown 66 34 29 14
## Red 16 7 7 7
## Blond 4 64 5 8
chisq.test(HairColor)
## Warning in chisq.test(HairColor): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: HairColor
## X-squared = 100, df = 9, p-value <2e-16
Conclusion The data are not independent, the categories show a pattern of dependency, what can we say about them?
rowsums=as.matrix(apply(HairColor,1,sum))
rowsums
## [,1]
## Black 52
## Brown 143
## Red 37
## Blond 81
colsums=as.matrix(apply(HairColor,2,sum))
colsums
## [,1]
## Brown 122
## Blue 114
## Hazel 46
## Green 31
HCexp=round(rowsums%*%t(colsums)/sum(colsums))
Exp = outer(apply(HairColor, 1, sum), apply(HairColor, 2, sum))
#Here is actually how the chisquare is computed
sum((HairColor - Exp)^2/Exp)
## [1] 97344
HairColor-HCexp
## Eye
## Hair Brown Blue Hazel Green
## Black 16 -10 -3 -3
## Brown 10 -18 8 0
## Red 2 -6 2 3
## Blond -28 34 -7 0
require(vcd)
## Loading required package: vcd
##
## Attaching package: 'vcd'
## The following object is masked from 'package:GenomicRanges':
##
## tile
## The following object is masked from 'package:IRanges':
##
## tile
mosaic(HairColor,shade=TRUE)
What special property does the HCexp/Exp matrix have?
If we are comparing two categorical variables, (hair color, eye color), (color, emotion), Counts in the table approximately the margin products: for a \(I \times J\) contingency table with a total sample size of \(n=\sum_{i=1}^I \sum_{j=1}^J n_{ij}= n_{\cdot \cdot}\).
\[{\Large n_{ij} \doteq \frac{n_{i\cdot}}{n} \frac{n_{\cdot j}}{n} n}\] can also be written: \(\mathbf N \doteq \/c \/r' n\), where \[c=\frac{1}{n} {\/N} {\/1}_m \;\mbox{ and }\; \/ r'=\frac{1}{n} \/N' {\/1}_p \] The departure from independence is measured by the \(\chi^2\) statistic \[{\Large {\cal X}^2=\sum_{i,j} [\frac{(n_{ij}-\frac{n_{i\cdot}}{n}\frac{n_{\cdot j}}{n}n)^2} {\frac{n_{i\cdot} n_{\cdot j}}{n^2}n}]}\]
Many implemetations: - dudi.coa
in ade4
- CCA
in vegan
- ordination in phyloseq
res.coa=dudi.coa(HairColor,scannf=FALSE)
## Warning in Ops.factor(left, right): '<' not meaningful for factors
## Warning in Ops.factor(left, right): '<' not meaningful for factors
## Error in if (any(df < 0)) stop("negative entries in table"): missing value where TRUE/FALSE needed
names(res.coa)
## Error in eval(expr, envir, enclos): object 'res.coa' not found
scatter(res.coa)
## Error in scatter(res.coa): object 'res.coa' not found
s.label(res.coa$c1,boxes=FALSE)
## Error in data.frame(dfxy): object 'res.coa' not found
s.label(res.coa$li,add.plot=TRUE)
## Error in data.frame(dfxy): object 'res.coa' not found
Paper: Boyu Ren, Sergio Bacallado, Stefano Favro, Susan Holmes, Lorenzo Trippa JASA: preprint
Whole talk with model for uncertainty
Uses a Bayesian statistical model.