--- title: "STAMPS: DADA and Mixtures" author: "Susan Holmes" date: "August 1, 2018" output: slidy_presentation: font_adjustment: +1 --- # In a run we expect different numbers of reads for each true strain Even if the actual relative abundances were equal: - **Techological** replicates have Poisson noise. ## Different strains have different abundances The distribution of strains and species is very variable. - Different bacteria are present in different abundances within a sample. - Different samples will have different sample depths. - Different samples will have different bacterial compositions. ```{r} set.seed(1945) ##Suppose a small error rate for a sequence techn=rpois(200,0.005) table(techn) mean(techn) # About 0.01 per read run=rpois(1e5,0.01) table(run) sim1K=replicate(1000,sum(rpois(1e5,0.01))) mean(sim1K) ``` #Runs of about 100,000 reads can expect 1,000 read-errors ```{r} hist(sim1K,50,col="lavender") ``` # What is the consequence of having different numbers of reads? We'll need a little auxiliary package: ```{r, eval=FALSE} install.packages("mixtools") ``` ```{r,fig.width=5, fig.height=5} library("mixtools") S1=1e3 S2=1e5 seq1=rmvnorm(n=S1,mu=c(-1,-1),sigma=diag(c(0.5,0.5))) seq2=rmvnorm(n=S2,mu=c(1,1),sigma=diag(c(0.5,0.5))) twogr=data.frame(rbind(seq1,seq2),gr=factor(c(rep(1,S1),rep(2,S2)))) colnames(twogr)[1:2]=c("x","y") plot(twogr[,1:2],col=twogr$gr) ``` ```{r,fig.width=5, fig.height=5} plot(twogr[,1:2],pch='.',col=twogr$gr) ``` # What do we notice? (a nice picture if you want to use ggplot2) ```{r, fig.width=5, fig.height=4} library(ggplot2) v <- ggplot(twogr, aes(x=x, y=y,fill=gr,colour=gr)) v + geom_point(alpha=0.3) + coord_fixed() ``` # A better plot for dense areas: ```{r} v + geom_hex(alpha=0.5, bins= 50) + coord_fixed() ``` ```{r} v + geom_hex(alpha=0.4, bins= 50)+ annotate("path", x=1+1.3*cos(seq(0,2*pi,length.out=100)), y=1+1.3*sin(seq(0,2*pi,length.out=100)))+ annotate("path", x=-1+1.3*cos(seq(0,2*pi,length.out=100)), y=-1+1.3*sin(seq(0,2*pi,length.out=100)))+ annotate("text", x = -1, y = -1, label = "OTU 1")+ annotate("text", x = 1, y = 1, label = "OTU 2")+ coord_fixed() ``` # The radius of attraction for the two sequences is different. - 1) We can't use a fixed radius for calling sequence cluster. - 2) We need to do something about the errors to improve the calls. # The taxicab problem ![Amos Tversky and Danny Kahnneman](http://bios221.stanford.edu/images/amosdannyflowers.png){width="50%"} Kanneman and Tversky: Heuristics and biases. See [Thinking Fast and Slow, D. Kannemean](https://en.wikipedia.org/wiki/Thinking,_Fast_and_Slow) Representativeness heuristic and base rate neglect. # Model using the frequencies in your data ```{r dadaarrow, fig.width=6, fig.height=5,echo=FALSE,message=FALSE} library(png) library(grid) img=readPNG("DADA_Arrows.png") grid.raster(img) ``` # Probability that it is a new sequence given the other information We can use the data to look at error rates as we go along: EM: expectation-maximization method. Alternating method: pretend we know the errors. - Assign reads. - Use new reads to compute the error rates....etc.. # Probability Model ``` s: ATTAACGAGATTATAACCAGAGTACGAATA... | | r: ATCAACGAGATTATAACAAGAGTACGAATA... ``` $$P(r|s)=\prod_{i=1}^L P(r(i)|s(i),q_r(i),Z)$$ $P$ probabilities of substitutions ($A -> C$) We estimate the ($A -> C$) error transitions from the data and these are output by the program and plotted. $q$ Quality score (Q=30) Batch effect (run) # Tutorials Take away the .txt from the file name to obtain just the R file (you'll need to download the data from the stamps server first): ### Your turn: [DADA_Ex.R.txt](http://web.stanford.edu/class/bios221/stamps/DADA_Ex.R.txt) ### Full stack tutorial run through: [Ben's full stack tutorial](http://benjjneb.github.io/dada2/tutorial.html) ### The most beautiful workflow ever! [Mike Lee's workflow](https://astrobiomike.github.io/amplicon/dada2_workflow_ex) ### Benchmarking and Quality Control [State of the Art (bjc)](http://benjjneb.github.io/dada2/SotA.html) ### Complete worflow [BioC workshop](http://web.stanford.edu/class/bios221/MicrobiomeWorkflowII.html) ### F1000Research paper [link](https://f1000research.com/articles/5-1492/v2) # Statistical background explaining EM, model etc [Book link:mixtures](http://web.stanford.edu/class/bios221/book/Chap-Mixtures.html) [Book link:clustering](http://web.stanford.edu/class/bios221/book/Chap-Clustering.html)