In a run we expect different numbers of reads for each true strain

Even if the actual relative abundances were equal: - Techological replicates have Poisson noise.

Different strains have different abundances

The distribution of strains and species is very variable.
- Different bacteria are present in different abundances within a sample.
- Different samples will have different sample depths.
- Different samples will have different bacterial compositions.

set.seed(1945)
##Suppose a small error rate for a sequence
techn=rpois(200,0.005)
table(techn)

## techn
##   0   1 
## 198   2

mean(techn)

## [1] 0.01

# About 0.01 per read
run=rpois(1e5,0.01)
table(run)

## run
##     0     1     2 
## 98990  1005     5

sim1K=replicate(1000,sum(rpois(1e5,0.01)))
mean(sim1K)

## [1] 999.317

Runs of about 100,000 reads can expect 1,000 read-errors

hist(sim1K,50,col="lavender")

What is the consequence of having different numbers of reads?

We’ll need a little auxiliary package:

install.packages("mixtools")

library("mixtools")

## mixtools package, version 1.1.0, Released 2017-03-10
## This package is based upon work supported by the National Science Foundation under Grant No. SES-0518772.

S1=1e3
S2=1e5
seq1=rmvnorm(n=S1,mu=c(-1,-1),sigma=diag(c(0.5,0.5)))
seq2=rmvnorm(n=S2,mu=c(1,1),sigma=diag(c(0.5,0.5)))
twogr=data.frame(rbind(seq1,seq2),gr=factor(c(rep(1,S1),rep(2,S2))))
colnames(twogr)[1:2]=c("x","y")
plot(twogr[,1:2],col=twogr$gr)

plot(twogr[,1:2],pch='.',col=twogr$gr)

What do we notice?

(a nice picture if you want to use ggplot2)

library(ggplot2)
v <- ggplot(twogr, aes(x=x, y=y,fill=gr,colour=gr))
v + geom_point(alpha=0.3) + coord_fixed()

A better plot for dense areas:

v + geom_hex(alpha=0.5, bins= 50) + coord_fixed()

v + geom_hex(alpha=0.4, bins= 50)+
annotate("path",
   x=1+1.3*cos(seq(0,2*pi,length.out=100)),
   y=1+1.3*sin(seq(0,2*pi,length.out=100)))+
annotate("path",
   x=-1+1.3*cos(seq(0,2*pi,length.out=100)),
   y=-1+1.3*sin(seq(0,2*pi,length.out=100)))+
annotate("text", x = -1, y = -1, label = "OTU  1")+   
annotate("text", x = 1, y = 1, label = "OTU  2")+    
coord_fixed()

The radius of attraction for the two sequences is different.

1. We can’t use a fixed radius for calling sequence cluster.
1. We need to do something about the errors to improve the calls.

The taxicab problem

$Amos Tversky and Danny Kahnneman$

Amos Tversky and Danny Kahnneman

Kanneman and Tversky: Heuristics and biases.

See Thinking Fast and Slow, D. Kannemean

Representativeness heuristic and base rate neglect.

Model using the frequencies in your data

Probability that it is a new sequence given the other information

We can use the data to look at error rates as we go along:

EM: expectation-maximization method.

Alternating method: pretend we know the errors.

Assign reads.
Use new reads to compute the error rates….etc..

Probability Model

s: ATTAACGAGATTATAACCAGAGTACGAATA...
     |              |
r: ATCAACGAGATTATAACAAGAGTACGAATA...

\[P(r|s)=\prod_{i=1}^L P(r(i)|s(i),q_r(i),Z)\] $P$ probabilities of substitutions ($A -> C$)

We estimate the ($A -> C$) error transitions from the data and these are output by the program and plotted.

$q$ Quality score (Q=30)

Batch effect (run)

Tutorials

Take away the .txt from the file name to obtain just the R file (you’ll need to download the data from the stamps server first):

STAMPS: DADA and Mixtures

In a run we expect different numbers of reads for each true strain

Different strains have different abundances

Runs of about 100,000 reads can expect 1,000 read-errors

What is the consequence of having different numbers of reads?

What do we notice?

A better plot for dense areas:

The radius of attraction for the two sequences is different.

The taxicab problem

Model using the frequencies in your data

Probability that it is a new sequence given the other information

Probability Model

Tutorials

Your turn:

Full stack tutorial run through:

The most beautiful workflow ever!

Benchmarking and Quality Control

Complete worflow BioC workshop

F1000Research paper link

Statistical background explaining EM, model etc