---
title: "STAMPS: DADA and Mixtures"
author: "Susan Holmes"
date: "August 1, 2018"
output:
slidy_presentation:
font_adjustment: +1
---
# In a run we expect different numbers of reads for each true strain
Even if the actual relative abundances were equal:
- **Techological** replicates have Poisson noise.
## Different strains have different abundances
The distribution of strains and species is very variable.
- Different bacteria are present in different abundances within a sample.
- Different samples will have different sample depths.
- Different samples will have different bacterial compositions.
```{r}
set.seed(1945)
##Suppose a small error rate for a sequence
techn=rpois(200,0.005)
table(techn)
mean(techn)
# About 0.01 per read
run=rpois(1e5,0.01)
table(run)
sim1K=replicate(1000,sum(rpois(1e5,0.01)))
mean(sim1K)
```
#Runs of about 100,000 reads can expect 1,000 read-errors
```{r}
hist(sim1K,50,col="lavender")
```
# What is the consequence of having different numbers of reads?
We'll need a little auxiliary package:
```{r, eval=FALSE}
install.packages("mixtools")
```
```{r,fig.width=5, fig.height=5}
library("mixtools")
S1=1e3
S2=1e5
seq1=rmvnorm(n=S1,mu=c(-1,-1),sigma=diag(c(0.5,0.5)))
seq2=rmvnorm(n=S2,mu=c(1,1),sigma=diag(c(0.5,0.5)))
twogr=data.frame(rbind(seq1,seq2),gr=factor(c(rep(1,S1),rep(2,S2))))
colnames(twogr)[1:2]=c("x","y")
plot(twogr[,1:2],col=twogr$gr)
```
```{r,fig.width=5, fig.height=5}
plot(twogr[,1:2],pch='.',col=twogr$gr)
```
# What do we notice?
(a nice picture if you want to use ggplot2)
```{r, fig.width=5, fig.height=4}
library(ggplot2)
v <- ggplot(twogr, aes(x=x, y=y,fill=gr,colour=gr))
v + geom_point(alpha=0.3) + coord_fixed()
```
# A better plot for dense areas:
```{r}
v + geom_hex(alpha=0.5, bins= 50) + coord_fixed()
```
```{r}
v + geom_hex(alpha=0.4, bins= 50)+
annotate("path",
x=1+1.3*cos(seq(0,2*pi,length.out=100)),
y=1+1.3*sin(seq(0,2*pi,length.out=100)))+
annotate("path",
x=-1+1.3*cos(seq(0,2*pi,length.out=100)),
y=-1+1.3*sin(seq(0,2*pi,length.out=100)))+
annotate("text", x = -1, y = -1, label = "OTU 1")+
annotate("text", x = 1, y = 1, label = "OTU 2")+
coord_fixed()
```
# The radius of attraction for the two sequences is different.
- 1) We can't use a fixed radius for calling sequence cluster.
- 2) We need to do something about the errors to improve the calls.
# The taxicab problem
![Amos Tversky and Danny Kahnneman](http://bios221.stanford.edu/images/amosdannyflowers.png){width="50%"}
Kanneman and Tversky: Heuristics and biases.
See [Thinking Fast and Slow, D. Kannemean](https://en.wikipedia.org/wiki/Thinking,_Fast_and_Slow)
Representativeness heuristic and base rate neglect.
# Model using the frequencies in your data
```{r dadaarrow, fig.width=6, fig.height=5,echo=FALSE,message=FALSE}
library(png)
library(grid)
img=readPNG("DADA_Arrows.png")
grid.raster(img)
```
# Probability that it is a new sequence given the other information
We can use the data to look at error rates as we go along:
EM: expectation-maximization method.
Alternating method: pretend we know the errors.
- Assign reads.
- Use new reads to compute the error rates....etc..
# Probability Model
```
s: ATTAACGAGATTATAACCAGAGTACGAATA...
| |
r: ATCAACGAGATTATAACAAGAGTACGAATA...
```
$$P(r|s)=\prod_{i=1}^L P(r(i)|s(i),q_r(i),Z)$$
$P$ probabilities of substitutions ($A -> C$)
We estimate the ($A -> C$) error transitions from the data and these are output by the program and plotted.
$q$ Quality score (Q=30)
Batch effect (run)
# Tutorials
Take away the .txt from the file name to obtain just the R file
(you'll need to download the data from the stamps server first):
### Your turn:
[DADA_Ex.R.txt](http://web.stanford.edu/class/bios221/stamps/DADA_Ex.R.txt)
### Full stack tutorial run through:
[Ben's full stack tutorial](http://benjjneb.github.io/dada2/tutorial.html)
### The most beautiful workflow ever!
[Mike Lee's workflow](https://astrobiomike.github.io/amplicon/dada2_workflow_ex)
### Benchmarking and Quality Control
[State of the Art (bjc)](http://benjjneb.github.io/dada2/SotA.html)
### Complete worflow [BioC workshop](http://web.stanford.edu/class/bios221/MicrobiomeWorkflowII.html)
### F1000Research paper [link](https://f1000research.com/articles/5-1492/v2)
# Statistical background explaining EM, model etc
[Book link:mixtures](http://web.stanford.edu/class/bios221/book/Chap-Mixtures.html)
[Book link:clustering](http://web.stanford.edu/class/bios221/book/Chap-Clustering.html)