Susan Holmes
August 1, 2018
Even if the actual relative abundances were equal: - Techological replicates have Poisson noise.
The distribution of strains and species is very variable.
- Different bacteria are present in different abundances within a sample.
- Different samples will have different sample depths.
- Different samples will have different bacterial compositions.
set.seed(1945)
##Suppose a small error rate for a sequence
techn=rpois(200,0.005)
table(techn)
## techn
## 0 1
## 198 2
mean(techn)
## [1] 0.01
# About 0.01 per read
run=rpois(1e5,0.01)
table(run)
## run
## 0 1 2
## 98990 1005 5
sim1K=replicate(1000,sum(rpois(1e5,0.01)))
mean(sim1K)
## [1] 999.317
hist(sim1K,50,col="lavender")
We’ll need a little auxiliary package:
install.packages("mixtools")
library("mixtools")
## mixtools package, version 1.1.0, Released 2017-03-10
## This package is based upon work supported by the National Science Foundation under Grant No. SES-0518772.
S1=1e3
S2=1e5
seq1=rmvnorm(n=S1,mu=c(-1,-1),sigma=diag(c(0.5,0.5)))
seq2=rmvnorm(n=S2,mu=c(1,1),sigma=diag(c(0.5,0.5)))
twogr=data.frame(rbind(seq1,seq2),gr=factor(c(rep(1,S1),rep(2,S2))))
colnames(twogr)[1:2]=c("x","y")
plot(twogr[,1:2],col=twogr$gr)
plot(twogr[,1:2],pch='.',col=twogr$gr)
(a nice picture if you want to use ggplot2)
library(ggplot2)
v <- ggplot(twogr, aes(x=x, y=y,fill=gr,colour=gr))
v + geom_point(alpha=0.3) + coord_fixed()
v + geom_hex(alpha=0.5, bins= 50) + coord_fixed()
v + geom_hex(alpha=0.4, bins= 50)+
annotate("path",
x=1+1.3*cos(seq(0,2*pi,length.out=100)),
y=1+1.3*sin(seq(0,2*pi,length.out=100)))+
annotate("path",
x=-1+1.3*cos(seq(0,2*pi,length.out=100)),
y=-1+1.3*sin(seq(0,2*pi,length.out=100)))+
annotate("text", x = -1, y = -1, label = "OTU 1")+
annotate("text", x = 1, y = 1, label = "OTU 2")+
coord_fixed()
Kanneman and Tversky: Heuristics and biases.
See Thinking Fast and Slow, D. Kannemean
Representativeness heuristic and base rate neglect.
We can use the data to look at error rates as we go along:
EM: expectation-maximization method.
Alternating method: pretend we know the errors.
Assign reads.
Use new reads to compute the error rates….etc..
s: ATTAACGAGATTATAACCAGAGTACGAATA...
| |
r: ATCAACGAGATTATAACAAGAGTACGAATA...
\[P(r|s)=\prod_{i=1}^L P(r(i)|s(i),q_r(i),Z)\] \(P\) probabilities of substitutions (\(A -> C\))
We estimate the (\(A -> C\)) error transitions from the data and these are output by the program and plotted.
\(q\) Quality score (Q=30)
Batch effect (run)
Take away the .txt from the file name to obtain just the R file (you’ll need to download the data from the stamps server first):