STAMPS: DADA and Mixtures

Susan Holmes

August 1, 2018

In a run we expect different numbers of reads for each true strain

Even if the actual relative abundances were equal: - Techological replicates have Poisson noise.

Different strains have different abundances

The distribution of strains and species is very variable.
- Different bacteria are present in different abundances within a sample.
- Different samples will have different sample depths.
- Different samples will have different bacterial compositions.

set.seed(1945)
##Suppose a small error rate for a sequence
techn=rpois(200,0.005)
table(techn)
## techn
##   0   1 
## 198   2
mean(techn)
## [1] 0.01
# About 0.01 per read
run=rpois(1e5,0.01)
table(run)
## run
##     0     1     2 
## 98990  1005     5
sim1K=replicate(1000,sum(rpois(1e5,0.01)))
mean(sim1K)
## [1] 999.317

Runs of about 100,000 reads can expect 1,000 read-errors

hist(sim1K,50,col="lavender")

What is the consequence of having different numbers of reads?

We’ll need a little auxiliary package:

install.packages("mixtools")
library("mixtools")
## mixtools package, version 1.1.0, Released 2017-03-10
## This package is based upon work supported by the National Science Foundation under Grant No. SES-0518772.
S1=1e3
S2=1e5
seq1=rmvnorm(n=S1,mu=c(-1,-1),sigma=diag(c(0.5,0.5)))
seq2=rmvnorm(n=S2,mu=c(1,1),sigma=diag(c(0.5,0.5)))
twogr=data.frame(rbind(seq1,seq2),gr=factor(c(rep(1,S1),rep(2,S2))))
colnames(twogr)[1:2]=c("x","y")
plot(twogr[,1:2],col=twogr$gr)

plot(twogr[,1:2],pch='.',col=twogr$gr)

What do we notice?

(a nice picture if you want to use ggplot2)

library(ggplot2)
v <- ggplot(twogr, aes(x=x, y=y,fill=gr,colour=gr))
v + geom_point(alpha=0.3) + coord_fixed()

A better plot for dense areas:

v + geom_hex(alpha=0.5, bins= 50) + coord_fixed()

v + geom_hex(alpha=0.4, bins= 50)+
annotate("path",
   x=1+1.3*cos(seq(0,2*pi,length.out=100)),
   y=1+1.3*sin(seq(0,2*pi,length.out=100)))+
annotate("path",
   x=-1+1.3*cos(seq(0,2*pi,length.out=100)),
   y=-1+1.3*sin(seq(0,2*pi,length.out=100)))+
annotate("text", x = -1, y = -1, label = "OTU  1")+   
annotate("text", x = 1, y = 1, label = "OTU  2")+    
coord_fixed()

The radius of attraction for the two sequences is different.

The taxicab problem

Amos Tversky and Danny Kahnneman

Amos Tversky and Danny Kahnneman

Kanneman and Tversky: Heuristics and biases.

See Thinking Fast and Slow, D. Kannemean

Representativeness heuristic and base rate neglect.

Model using the frequencies in your data

Probability that it is a new sequence given the other information

We can use the data to look at error rates as we go along:

EM: expectation-maximization method.

Alternating method: pretend we know the errors.

Probability Model

s: ATTAACGAGATTATAACCAGAGTACGAATA...
     |              |
r: ATCAACGAGATTATAACAAGAGTACGAATA...

\[P(r|s)=\prod_{i=1}^L P(r(i)|s(i),q_r(i),Z)\] \(P\) probabilities of substitutions (\(A -> C\))

We estimate the (\(A -> C\)) error transitions from the data and these are output by the program and plotted.

\(q\) Quality score (Q=30)

Batch effect (run)

Tutorials

Take away the .txt from the file name to obtain just the R file (you’ll need to download the data from the stamps server first):

Your turn:

DADA_Ex.R.txt

Full stack tutorial run through:

Ben’s full stack tutorial

The most beautiful workflow ever!

Mike Lee’s workflow

Benchmarking and Quality Control

State of the Art (bjc)

Complete worflow BioC workshop

Statistical background explaining EM, model etc

Book link:mixtures

Book link:clustering