class: center, middle, inverse, title-slide # PLNmodels ## A collection of Poisson lognormal models
for multivariate analysis of count data ### J. Chiquet, M. Mariadassou, S. Robin
INRAE - Applied Mathematics and Informatics Division
Last update 06 May, 2021
###
https://pln-team.github.io/PLNmodels
--- # Reproducibility ## Package PLNmodels Last stable release of **PLNmodels** on CRAN, development version available on GitHub. ```r install.packages("PLNmodels") devtools::install_github("jchiquet/PLNmodels") ``` ```r library(PLNmodels) packageVersion("PLNmodels") ``` ``` ## [1] '0.11.5.9010' ``` ## Help and documentation The [PLNmodels website](https://pln-team.github.io/PLNmodels/) contains - the standard package documentation - a set of comprehensive vignettes for the top-level functions all formatted with [**pkgdown**](https://pkgdown.r-lib.org) --- class: inverse, center, middle # 'Oaks powdery mildew' <br/> A motivating companion data set <br/> .small[See Jakuschkin, Fievet, Schwaller, Fort, Robin, and Vacher [Jak+16]] --- # Generic form of data sets Routinely gathered in ecology/microbiology/genomics ### Data tables - .important[Abundances]: read counts of species/transcripts `\(j\)` in sample `\(i\)` - .important[Covariates]: value of environmental variable `\(k\)` in sample `\(i\)` - .important[Offsets]: sampling effort for species/transcripts `\(j\)` in sample `\(i\)` ### Need a framework to model _dependencies between counts_ - understand .important[environmental effects] <br/> `\(\rightsquigarrow\)` explanatory models (multivariate regression, classification) - exhibit .important[patterns of diversity] <br/> `\(\rightsquigarrow\)` summarize the information (clustering, dimension reduction) - understand .important[between-species interactions] <br /> `\(\rightsquigarrow\)` 'network' inference (variable/covariance selection) - correct for technical and .important[confounding effects] <br/> `\(\rightsquigarrow\)` account for covariables and sampling effort --- # Oaks powdery mildew data set overview - .important[Microbial communities] sampled on the surface of `\(n = 116\)` oak leaves - Communities sequenced and cleaned resulting in `\(p=114\)` OTUs (66 bacteria, 48 fungi). - Study .important[effects of the pathogen] _E.Aphiltoïdes_ wrt communities The `oaks` variable consists in a special data frame ready to play with, typical from ecological data sets (try `?oaks` to get additional details). <small> ```r data("oaks") str(oaks, max.level = 1) ``` ``` ## 'data.frame': 116 obs. of 13 variables: ## $ Abundance : int [1:116, 1:114] 0 0 0 0 0 0 0 0 0 0 ... ## ..- attr(*, "dimnames")=List of 2 ## $ Sample : Factor w/ 116 levels "A1.02","A1.03",..: 1 2 3 4 5 6 7 8 9 10 ... ## $ tree : Factor w/ 3 levels "susceptible",..: 2 2 2 2 2 2 2 2 2 2 ... ## $ branch : int 1 1 1 1 1 1 1 1 1 2 ... ## $ leafNo : int 2 3 4 5 6 7 8 9 10 11 ... ## $ distTObase : num 217 187 180 159 148 ... ## $ distTOtrunk : int 202 175 168 148 138 136 115 88 79 198 ... ## $ distTOground: num 156 144 142 134 130 ... ## $ pmInfection : int 1 0 0 1 1 1 1 5 1 0 ... ## $ orientation : Factor w/ 2 levels "NE","SW": 2 2 2 2 2 2 2 2 2 2 ... ## $ readsTOTfun : int 2488 2054 2122 2625 2469 3156 2513 2083 2117 2522 ... ## $ readsTOTbac : int 8315 662 480 674 643 3096 1347 2780 1346 1042 ... ## $ Offset : int [1:116, 1:114] 8315 662 480 674 643 3096 1347 2780 1346 1042 ... ## ..- attr(*, "dimnames")=List of 2 ``` </small> --- # Covariates and offsets Characterize the samples and the sampling, most important being - `tree`: Tree status with respect to the pathogen (susceptible, intermediate or resistant) - `distTOground`: Distance of the sampled leaf to the base of the ground - `orientation`: Orientation of the branch (South-West SW or North-East NE) - `readsTOTfun`: Total number of ITS1 reads for that leaf - `readsTOTbac`: Total number of 16S reads for that leaf ```r summary(oaks$tree) ``` ``` ## susceptible intermediate resistant ## 39 38 39 ``` ```r summary(oaks$distTOground) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 63.0 130.2 178.5 173.8 216.2 272.0 ``` `\(\rightsquigarrow\)` `readsTOTfun` and `readsTOTbac` are candidate for modeling sampling effort as offsets --- # Abundance table (I) ```r oaks$Abundance %>% as_tibble() %>% dplyr::select(1:10) %>% head() %>% knitr::kable(format = "html") ``` <table> <thead> <tr> <th style="text-align:right;"> b_OTU_1045 </th> <th style="text-align:right;"> b_OTU_109 </th> <th style="text-align:right;"> b_OTU_1093 </th> <th style="text-align:right;"> b_OTU_11 </th> <th style="text-align:right;"> b_OTU_112 </th> <th style="text-align:right;"> b_OTU_1191 </th> <th style="text-align:right;"> b_OTU_1200 </th> <th style="text-align:right;"> b_OTU_123 </th> <th style="text-align:right;"> b_OTU_13 </th> <th style="text-align:right;"> b_OTU_1431 </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 146 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 68 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 128 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 121 </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 113 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 77 </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 20 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 20 </td> <td style="text-align:right;"> 90 </td> <td style="text-align:right;"> 57 </td> </tr> </tbody> </table> --- # Abundance table (II) ```r log(1 + oaks$Abundance) %>% corrplot::corrplot(is.corr = FALSE, addgrid.col = NA, tl.cex = .5, cl.pos = "n") ``` <!-- --> <!-- PLN MODEL --> --- class: inverse, center, middle # Multivariate Poisson lognormal models <br/> .small[Statistical framework, inference, optimisation] --- # Models for multivariate count data ## If we were in a Gaussian world... The .important[general linear model] [MKB79] would be appropriate! For each sample `\(i = 1,\dots,n\)`, `$$\underbrace{\mathbf{Y}_i}_{\text{abundances}} = \underbrace{\mathbf{x}_i^\top \boldsymbol\Theta}_{\text{covariates}} + \underbrace{\mathbf{o}_i}_{\text{sampling effort}} + \boldsymbol\varepsilon_i, \quad \boldsymbol\varepsilon_i \sim \mathcal{N}(\mathbf{0}_p, \underbrace{\boldsymbol\Sigma}_{\text{between-species dependencies}})$$` null covariance `\(\Leftrightarrow\)` independence `\(\rightsquigarrow\)` uncorrelated species/transcripts do not interact `\(\rightsquigarrow\)` This model gives birth to Principal Component Analysis, Discriminant Analysis, Gaussian Graphical Models, Gaussian Mixture models and many others `\(\dots\)` ## With count data... There is no generic model for multivariate counts - Data transformation (log, `\(\sqrt{}\)`) : quick and dirty - Non-Gaussian multivariate distributions [Ino+17]: do not scale to data dimension yet - .important[Latent variable models]: interaction occur in a latent (unobserved) layer --- # The Poisson Lognormal model (PLN) The PLN model [AH89] is a .important[multivariate generalized linear model], where - the counts `\(\mathbf{Y}_i\)` are the response variables - the main effect is due to a linear combination of the covariates `\(\mathbf{x}_i\)` - a vector of offsets `\(\mathbf{o}_i\)` can be specified for each sample. $$ \mathbf{Y}_i | \mathbf{Z}_i \sim \mathcal{P}\left(\exp\{\mathbf{Z}_i\}\right), \qquad \mathbf{Z}_i \sim \mathcal{N}({\mathbf{o}_i + \mathbf{x}_i^\top\boldsymbol\Theta},\boldsymbol\Sigma), \\ $$ .pull-left[The unkwown parameters are - `\(\boldsymbol\Theta\)`, the regression parameters - `\(\boldsymbol\Sigma\)`, the variance-covariance matrix ] .pull-right[ Stacking all individuals together, - `\(\mathbf{Y}\)` is the `\(n\times p\)` matrix of counts - `\(\mathbf{X}\)` is the `\(n\times d\)` matrix of design - `\(\mathbf{O}\)` is the `\(n\times p\)` matrix of offsets ] ### Properties: .small[.important[over-dispersion, arbitrary-signed covariances]] - mean: `\(\mathbb{E}(Y_{ij}) = \exp \left( o_{ij} + \mathbf{x}_i^\top {\boldsymbol\Theta}_{\cdot j} + \sigma_{jj}/2\right) > 0\)` - variance: `\(\mathbb{V}(Y_{ij}) = \mathbb{E}(Y_{ij}) + \mathbb{E}(Y_{ij})^2 \left( e^{\sigma_{jj}} - 1 \right) > \mathbb{E}(Y_{ij})\)` - covariance: `\(\mathrm{Cov}(Y_{ij}, Y_{ik}) = \mathbb{E}(Y_{ij}) \mathbb{E}(Y_{ik}) \left( e^{\sigma_{jk}} - 1 \right).\)` --- # Geometrical view <!-- --> --- # Inference: .small[latent model but intractable EM] Estimate `\(\theta = (\boldsymbol\Theta, \boldsymbol\Sigma)\)`, predict the `\(\mathbf{Z}_i\)`, while the model marginal likelihood is `$$p_\theta(\mathbf{Y}_i) = \int_{\mathbb{R}_p} \prod_{j=1}^p p_\theta(Y_{ij} | Z_{ij}) \, p_\theta(\mathbf{Z}_i) \mathrm{d}\mathbf{Z}_i$$` ### Maximum likelihood for incomplete data model: EM With `\(\mathcal{H}(p) = -\mathbb{E}_p(\log(p))\)` the entropy of `\(p\)`, `$$\log p_\theta(\mathbf{Y}) = \mathbb{E}_{p_\theta(\mathbf{Z}\,|\,\mathbf{Y})} [\log p_\theta(\mathbf{Y}, \mathbf{Z})] + \mathcal{H}[p_\theta(\mathbf{Z}\,|\,\mathbf{Y})]$$` EM requires to evaluate (some moments of) `\(p_\theta(\mathbf{Z} \,|\, \mathbf{Y})\)`, but there is no close form! ### Solutions - [AH89] resort on numerical integration; [Kar05] Monte-Carlo integration - Several heuristics, not always well motivated, found in the literature... - .important[Variational approach] [WJ08]: use a proxy of `\(p_\theta(\mathbf{Z}\,|\,\mathbf{Y})\)`. --- # Variational approximation .important[See the outstanding Stéphane Robin's Lecture] ## Principle - Find a proxy of the conditional distribution `\(p(\mathbf{Z}\,|\,\mathbf{Y})\)`: `$$q(\mathbf{Z}) \approx p_\theta(\mathbf{Z} | \mathbf{Y}).$$` - Choose a convenient class of distribution `\(\mathcal{Q}\)` and minimize a divergence `$$q(\mathbf{Z})^\star \arg\min_{q\in\mathcal{Q}} D\left(q(\mathbf{Z}), p(\mathbf{Z} | \mathbf{Y})\right).$$` ## Popular choice The Küllback-Leibler divergence .small[(error averaged wrt the approximated distribution)] `$$KL\left(q(\mathbf{Z}), p(\mathbf{Z} | \mathbf{Y})\right) = \mathbb{E}_q\left[\log \frac{q(z)}{p(z)}\right] = \int_{\mathcal{Z}} q(z) \log \frac{q(z)}{p(z)} \mathrm{d}z.$$` --- # Variational EM & PLN ## Class of distribution: diagonal multivariate Gaussian `$$\mathcal{Q} = \Big\{q: \quad q(\mathbf{Z}) = \prod_i q_i(\mathbf{Z}_i), \quad q_i(\mathbf{Z}_i) = \mathcal{N}\left(\mathbf{Z}_i; \mathbf{m}_i, \mathrm{diag}(\mathbf{s}_i \circ \mathbf{s}_i)\right), \mathbf{m}_i, \mathbf{s}_i\in\mathbb{R}_p \Big\}$$` Maximize the ELBO (Evidence Lower BOund): `$$J(\theta, q) = \log p_\theta(\mathbf{Y}) - KL[q_\theta (\mathbf{Z}) || p_\theta(\mathbf{Z} | \mathbf{Y})] = \mathbb{E}_{q} [\log p_\theta(\mathbf{Y}, \mathbf{Z})] + \mathcal{H}[q(\mathbf{Z})]$$` ## Variational EM - VE step: find the optimal `\(q\)` (here, `\(\{(\mathbf{m}_i, \mathbf{s}_i)\}_{i=1,\dots,n} = \{\mathbf{M}, \mathbf{S}\}\)`): `$$q^h = \arg \max J(\theta^h, q) = \arg\min_{q \in \mathcal{Q}} KL[q(\mathbf{Z}) \,||\, p_{\theta^h}(\mathbf{Z}\,|\,\mathbf{Y})]$$` - M step: update `\(\hat{\theta}^h\)` `$$\theta^h = \arg\max J(\theta, q^h) = \arg\max_{\theta} \mathbb{E}_{q} [\log p_{\theta}(\mathbf{Y}, \mathbf{Z})]$$` --- # ELBO and gradients for PLN Let `\(\mathbf{A} = \mathbb{E}_q[\exp(\mathbf{Z})] = \exp\left(\mathbf{O} + \mathbf{M} + \frac12 \mathbf{S}^2\right)\)` ### Variational bound `$$\begin{array}{ll} J(\mathbf{Y}) & = \mathbf{1}_n^\intercal \left( \left[ \mathbf{Y} \circ (\mathbf{O} + \mathbf{M}) - \mathbf{A} + \frac12\log(\mathbf{S}^2)\right]\right) \mathbf{1}_{p} + \frac{n}2\log|{\boldsymbol\Omega}| \\ & - \frac12 \mathrm{trace}\left({\boldsymbol\Omega} \bigg[\left(\mathbf{M} - \mathbf{X}{\boldsymbol\Theta}\right)^\intercal \left(\mathbf{M} - \mathbf{X}{\boldsymbol\Theta}\right) + \mathrm{diag}(\mathbf{1}_n^\intercal\mathbf{S}^2)\bigg]\right) + \text{cst.}\\ \end{array}$$` ### M-step `$$\hat{{\boldsymbol\Theta}} = \left(\mathbf{X}^\top \mathbf{X}\right)^{-1} \mathbf{X} \mathbf{M}, \quad \hat{{\boldsymbol\Sigma}} = \frac{1}{n} \left(\mathbf{M}-\mathbf{X}\hat{{\boldsymbol\Theta}}\right)^\top \left(\mathbf{M}-\mathbf{X}\hat{\boldsymbol\Theta}\right) + \frac{1}{n} \mathrm{diag}(\mathbf{1}^\intercal\mathbf{S}^2)$$` ### Variational E-step `$$\frac{\partial J(q)}{\partial \mathbf{M}} = \left(\mathbf{Y} - \mathbf{A} - (\mathbf{M} - \mathbf{X}{\boldsymbol\Theta}) \mathbf{\Omega}\right), \qquad \frac{\partial J(q)}{\partial \mathbf{S}} = \frac{1}{\mathbf{S}} - \mathbf{S} \circ \mathbf{A} - \mathbf{S} \mathrm{D}_{\boldsymbol\Omega} .$$` --- # .small[Application to the optimization of PLN models] ## Property of PLN variational approximation The ELBO `\(J(\theta, q)\)` is bi-concave, i.e. - concave wrt `\(q = (\mathbf{M}, \mathbf{S})\)` for given `\(\theta\)` - concave wrt `\(\theta = (\boldsymbol\Sigma, \boldsymbol\Theta)\)` for given `\(q\)` but .important[not jointly concave] in general. ## Optimization Gradient ascent for the complete set of parameters<sup>1</sup> `\((\mathbf{M}, \mathbf{S}, \boldsymbol\Sigma, \boldsymbol\Theta)\)` - **algorithm**: conservative convex separable approximations Svanberg [Sva02] <br/> - **implementation**: `NLopt` nonlinear-optimization library Johnson [Joh11] <br/> - **initialization**: LM after log-trasnformation applied independently on each variables + concatenation of the regression coefficients + Pearson residuals .footnote[[1] Alternating between variational and model parameters is useless here <br/> [2] Optimizing on `\(\mathbf{S}\)` such as `\(\mathbf{S} \circ \mathbf{S} = \mathbf{S}^2\)` is the variance avoids positive constraints ] --- # PLN: Natural extensions ### Various tasks of multivariate analysis - .important[Classification]: maximize separation between groups with means `$$\mathbf{Z}_i \sim \mathcal{N}({\boldsymbol\mu}_k \mathbf{1}_{\{i\in k\}}, \boldsymbol\Sigma), \quad \text{for known memberships}.$$` - .important[Clustering]: mixture model in the latent space `$$\mathbf{Z}_i \mid i \in k \sim \mathcal{N}(\boldsymbol\mu_k, \boldsymbol\Sigma_k), \quad \text{for unknown memberships}.$$` - .important[Dimension Reduction]: rank constraint matrix `\(\boldsymbol\Sigma\)`. `$$\mathbf{Z}_i \sim \mathcal{N}(\boldsymbol\mu, \boldsymbol\Sigma = \mathbf{B}\mathbf{B}^\top), \quad \mathbf{B} \in \mathcal{M}_{pk} \text{ with orthogonal columns}.$$` - .important[Network inference]: sparsity constraint on inverse covariance. `$$\mathbf{Z}_i \sim \mathcal{N}(\boldsymbol\mu, \boldsymbol\Sigma = \boldsymbol\Omega^{-1}), \quad \|\boldsymbol\Omega \|_1 < c.$$` ### Challenge .important[A variant] of the variational algorithm is required for each model --- class: inverse, center, middle # Multivariate Poisson Regression <br /> .small[Illustrating the oaks powdery mildew data set] --- # The PLN function The `PLN` function works in the same way as `lm`: ```r PLN(formula = , # mandatory data = , # highly recommended subset , # optional weights , # optional control # optional, mostly for advanced users ) ``` - `data` specifies where to look for the variables - `formula` specifies the relationship between variables in `data` ( `\(\rightsquigarrow\)` _It builds matrices_ `\(\mathbf{Y},\mathbf{X},\mathbf{O}\)`) - `subset` is used for subsampling the observations, it should be a .important[full length] boolean vector, not a vector of indices / sample names - `weights` is used to weighting the observations, - `control` is (mainly) used for tuning the optimization and should typically not be changed. --- # Simple PLN models on oaks data The simplest model we can imagine only has an intercept term: ```r M00_oaks <- PLN(Abundance ~ 1, oaks) ``` `M00_oaks` is a particular `R` object with class `PLNfit` that comes with a couple methods, helpfully listed when you print the object : ```r M00_oaks ``` ``` ## A multivariate Poisson Lognormal fit with full covariance model. ## ================================================================== ## nb_param loglik BIC ICL ## 6669 -32333.38 -48184.23 -52268.19 ## ================================================================== ## * Useful fields ## $model_par, $latent, $latent_pos, $var_par, $optim_par ## $loglik, $BIC, $ICL, $loglik_vec, $nb_param, $criteria ## * Useful S3 methods ## print(), coef(), sigma(), vcov(), fitted(), predict(), standard_error() ``` --- # Accessing parameters ```r coef(M00_oaks) %>% head() %>% t() %>% knitr::kable(format = "html") ``` <table> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> b_OTU_1045 </th> <th style="text-align:right;"> b_OTU_109 </th> <th style="text-align:right;"> b_OTU_1093 </th> <th style="text-align:right;"> b_OTU_11 </th> <th style="text-align:right;"> b_OTU_112 </th> <th style="text-align:right;"> b_OTU_1191 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> -1.24894 </td> <td style="text-align:right;"> -1.07084 </td> <td style="text-align:right;"> -2.276925 </td> <td style="text-align:right;"> 2.717515 </td> <td style="text-align:right;"> -0.4412986 </td> <td style="text-align:right;"> 0.4977729 </td> </tr> </tbody> </table> .pull-left[ ```r sigma(M00_oaks) %>% corrplot(is.corr=FALSE, tl.cex = .5) ``` <!-- --> ] .pull-right[ ```r sigma(M00_oaks) %>% cov2cor() %>% corrplot(tl.cex = .5) ``` <!-- --> ] --- # Adding Offsets and covariates ## Offset: .small[modeling sampling effort] The predefined offset uses the total sum of reads, accounting for technologies specific to fungi and bacteria: ```r M01_oaks <- PLN(Abundance ~ 1 + offset(log(Offset)) , oaks) ``` ## Covariates: .small[tree and orientation effects ('ANOVA'-like) ] The `tree` status is a natural candidate for explaining a part of the variance. - We chose to describe the tree effect in the regression coefficient (mean) - A possibly spurious effect regarding the interactions between species (covariance). ```r M11_oaks <- PLN(Abundance ~ 0 + tree + offset(log(Offset)), oaks) ``` What about adding more covariates in the model, e.g. the orientation? ```r M21_oaks <- PLN(Abundance ~ 0 + tree + orientation + offset(log(Offset)), oaks) ``` --- # Adding Offsets and covariates (II) There is a clear gain in introducing the tree covariate in the model: ```r rbind(M00 = M00_oaks$criteria, M01 = M01_oaks$criteria, M11 = M11_oaks$criteria, M21 = M21_oaks$criteria) %>% knitr::kable(format = "html") ``` <table> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> nb_param </th> <th style="text-align:right;"> loglik </th> <th style="text-align:right;"> BIC </th> <th style="text-align:right;"> ICL </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> M00 </td> <td style="text-align:right;"> 6669 </td> <td style="text-align:right;"> -32333.38 </td> <td style="text-align:right;"> -48184.23 </td> <td style="text-align:right;"> -52268.20 </td> </tr> <tr> <td style="text-align:left;"> M01 </td> <td style="text-align:right;"> 6669 </td> <td style="text-align:right;"> -32231.82 </td> <td style="text-align:right;"> -48082.66 </td> <td style="text-align:right;"> -52196.08 </td> </tr> <tr> <td style="text-align:left;"> M11 </td> <td style="text-align:right;"> 6897 </td> <td style="text-align:right;"> -31518.36 </td> <td style="text-align:right;"> -47911.12 </td> <td style="text-align:right;"> -51632.03 </td> </tr> <tr> <td style="text-align:left;"> M21 </td> <td style="text-align:right;"> 7011 </td> <td style="text-align:right;"> -31450.18 </td> <td style="text-align:right;"> -48113.89 </td> <td style="text-align:right;"> -51738.89 </td> </tr> </tbody> </table> Looking at the coefficients `\(\mathbf{\Theta}\)` associated with `tree` bring additional insights: <!-- --> --- # Residuals correlations For models with offsets and tree effect in the mean: .pull-left[ ```r sigma(M01_oaks) %>% cov2cor() %>% corrplot(is.corr=FALSE, tl.cex = .5) ``` <!-- --> ] .pull-right[ ```r sigma(M11_oaks) %>% cov2cor() %>% corrplot(tl.cex = .5) ``` <!-- --> ] --- class: inverse, center, middle # PLN: towards statistical properties and large-scale optimizer --- # Statistical Inference ## A first try: Wald test Test `\(\mathcal{H}_0: R \theta = r_0\)` with the statistic $$ (R \hat{\theta} - r_0)^\top \left[n R\hat{\mathbb{V}}(\hat{\theta}) R^\top \right]^{-1} (R \hat{\theta} - r_0) \sim \chi_k^2 \quad \text{where} \quad k = \text{rank}(R).$$ The Fisher Information matrix `$$I(\hat{\theta}) = - \mathbb{E}_\theta \left[ \frac{\partial^2 \log \ell(\theta; x)}{\partial \theta^2} \right]$$` can be used as an approximation of `\(n\mathbb{V}(\hat{\theta})^{-1}\)`. ## Application Derive confidences interval for the inverse covariance `\(\mathbf{\Omega}\)` and the regression parameters `\(\mathbf{\Theta}\)`. --- # Variational Wald-test ## Variational Fisher Information The Fisher information matrix is given by `$$I(\theta) = \begin{pmatrix} \frac{1}{n}(\mathbf{I}_p \otimes \mathbf{X}^\top)\mathrm{diag}(\mathrm{vec}(\mathbf{A}))(\mathbf{I}_p \otimes \mathbf{X}) & \mathbf{0} \\ \mathbf{0} & \frac12\mathbf{\Omega}^{-1} \otimes \mathbf{\Omega}^{-1} \end{pmatrix}$$` and can be inverted blockwise to estimate `\(\mathbb{V}(\hat{\theta})\)`. ## Wald test and coverage - `\(\hat{\mathbb{V}}(\mathbf{\Theta}_{kj}) = [n (\mathbf{X}^\top \mathrm{diag}(\mathrm{vec}(\hat{A}_{.j})) \mathbf{X})^{-1}]_{kk}\)` - `\(\hat{\mathbb{V}}(\Omega_{kl}) = 2\Omega_{kk}\Omega_{ll}\)` The confidence intervals at level `\(\alpha\)` are given by - `\(B_{kj} = \hat{B}_{kj} \pm \frac{q_{1 - \alpha/2}}{\sqrt{n}} \sqrt{\hat{\mathbb{V}}(\mathbf{\Theta}_{kj})}\)` - `\(\Omega_{kl} = \hat{\Omega}_{kl} \pm \frac{q_{1 - \alpha/2}}{\sqrt{n}} \sqrt{\hat{\mathbb{V}}(\Omega_{kl})}\)`. --- # Numerical study ## Study Bias and coverage of the estimator - number of samples `\(n \in \{50, 100, 500, 1000, 10000\}\)` - number of species/genes `\(p \in \{20, 200\}\)` - number of covariates `\(d \in \{2, 5, 10\}\)` - sampling effort `\(TSS \in \{\text{low}, \text{medium}, \text{high}\}\)` ($\approx 10^4$, `\(10^5\)` and `\(10^6\)`) - noise level `\(\sigma^2 \in \{0.2, 0.5, 1, 2\}\)` - `\(\boldsymbol\Sigma\)` as `\(\sigma_{jk} = \sigma^2 \rho^{|j-k|}\)`, with `\(\rho = 0.2\)` - `\(\mathbf{\Theta}\)` with entries sampled from `\(\mathcal{N}(0,1/d)\)` --- # Bias of `\(\hat{\boldsymbol\Theta}\)`  `\(\rightsquigarrow\)` .important[asymptotically unbiased] --- # Root mean square error of `\(\hat{\boldsymbol\Theta}\)`  `\(\rightsquigarrow\)` .important[asymptotically unbiased] --- # 95% confident interval - Coverage  .important[variance underestimated], no trusted confidence intervals can be derived out-of-the box --- # Other ideas ## Louis Formula Alternate form of the Fisher information matrix: same results... ## LR test, Rao test Same results ## M-estimation We can derive asymptotic behavior ... ... but VEM stationary point is .important[not a log-likelihood stationary point], see [WM15] `\(\rightsquigarrow\)` Sandwich estimator, correction... next try ! --- # Other tracks for optimization .important[Bastien Batardière's MsC internship] co-supervized with Joon Kwon and Laure Sansonnet ## Use tools from deep-learning Keep on optimizing the variational criterion, by relying on - tools for automatic differentiation - stochastic-gradient descent variants - GPU `\(\rightsquigarrow\)` first prove of concept with **Pytorch** for PLN/PLNPCA ## Alternate objective functions - Other variational approximations - Direct optimization of the log-likelihood via gradient approximation --- # PLN model: alternate objectives (I) ## Alternate formulation (inspired by PLN-PCA) `$$\begin{align*} W_i&\sim \mathcal{N}(0,I_p),\text{ iid}, \quad i=1,\dots,n\\ Z_i&=\mathbf{\Theta}\mathbf{x}_i+\mathbf{C} W_i,\quad i\in 1,\dots,n,\\ Y_{ij}|Z_{ij}&\sim \mathcal{P}(\exp \left( o_{ij}+Z_{ij} \right)). \end{align*}$$` ## Variational approximation on `\(W\)` We now use `\(q(\mathbf{W}) \approx p_\theta(\mathbf{W} | \mathbf{Y})\)`, with `\(q(\mathbf{W}_i) \sim \mathcal{N}(\mathbf{m}_i, \mathrm{diag}(\mathbf{s}_i \circ \mathbf{s}_i))\)` We found $$ `\begin{gather*} J(\mathbf{Y}) = \mathbf{1}_n^\intercal \left( \mathbf{Y} \circ \left[\mathbf{O} + \mathbf{X}\mathbf{\Theta} + \mathbf{M}\mathbf{C}^\top\right] - \mathbf{A} - \frac12 \left[\mathbf{M}^2 + \mathbf{S}^2 - \log(\mathbf{S}^2)\right] \right) \mathbf{1}_{p} + \text{cst.}\\ \mathrm{with} \quad \mathbf{A} = \exp(\mathbf{O} + \mathbf{X}\mathbf{\Theta} + \mathbf{M}\mathbf{C}^\top + \mathbf{S}^2 (\mathbf{C}^2)^\top ) \end{gather*}` $$ `\(\rightsquigarrow\)` Direct gradient ascent on `\((\mathbf{C}, \mathbf{\Theta}, \mathbf{M}, \mathbf{S})\)` --- # PLN model: alternate objectives (II) ## Target log-likelihood Likelihood of an observation `\(Y_i\)` `\((i\in [n])\)` writes: `$$\log p_{\theta}(Y_i)=\mathbb{E}_{W\sim \mathcal{N}(0,I_q)}\left[ \exp \left( \sum_{j=1}^{p}\quad \dots \quad \right) \right]$$` Function to minimize: `$$F(\theta)=-\frac{1}{n}\sum_{i=1}^n\log p_{\theta}(Y_i).$$` - Optimize log-likelihood directly - Statistical guarantees easiest to derive `\(\rightsquigarrow\)` How? --- # SGD + importance sampling ## Gradient Descent `$$\theta_{t+1}=\theta_t-\gamma_t\nabla F(\theta_t)$$` where `$$\nabla F(\theta)=-\frac{1}{n}\sum_{i=1}^n\frac{\nabla_{\theta} p_{\theta}(Y_i)}{p_{\theta}(Y_i)} =-\frac{1}{n}\sum_{i=1}^n\frac{\mathbb{E}_{W\sim \mathcal{N}(0,I_q)}\left[ \nabla_\theta\exp \left( \sum_{j=1}^{p}\left( \quad \dots \quad \right) \right) \right] }{\mathbb{E}_{W\sim \mathcal{N}(0,I_q)}\left[ \exp \left( \sum_{j=1}^{p}\left( \quad \dots \quad \right) \right) \right] }$$` ## Stochastic gradient Descent We replace `\(\nabla F(\theta_t)\)` by an unbiased estimator `\(\hat{g}_t\)` to get SGD: `$$\theta_{t+1}=\theta_t-\gamma_t\hat{g}_t,\qquad \text{where}\quad \mathbb{E}\left[ \hat{g}_t\mid \theta_t \right] =\nabla F(\theta_t).$$` Estimator `\(\hat{g}_t\)` can be constructed by - Sampling values of `\(W\)` - Possibly sampling a subset `\(I\subset \left\{ n \right\}\)` of a given cardinality. --- # Adaptive algorithms ## Sophistications of SGD - AdaGrad (2011) uses adaptive coordinate-wise step-sizes: `$$\theta_{t+1} = \theta_t-\frac{\gamma}{\sqrt{\varepsilon+G_t}} \odot \hat{g}_t \qquad \text{where} \quad G_t = \sum_{s=1}^{t}\hat{g}_s^{\odot 2}.$$` - RMSProp (2012) adds momentum to the step-sizes: `$$\theta_{t+1}=\theta_t-\frac{\gamma}{\sqrt{\varepsilon+G_t}}\odot \hat{g}_t\qquad \text{where}\quad G_t=\alpha G_{t-1}+(1-\alpha)\hat{g}_t^{\odot 2}.$$` - Adam (2015) also adds momentum to the gradients: `$$\theta_{t+1}=\theta_t-\frac{\gamma}{\sqrt{\varepsilon+G_t}}\odot \hat{m}_t\qquad \text{where}\quad m_t=\beta m_{t-1}+(1-\beta)\hat{g}_t.$$` <br /> `\(\rightsquigarrow\)` All available in **Pytorch** with auto-differentiation. --- class: inverse, center, middle # Classification for counts <br /> PLN discriminant analysis --- # Poisson Linear Discriminant Analysis - a PLN model with a discrete known structure with `\(K\)` groups - group specific main effect `\({\boldsymbol\mu}_k \in\mathbb{R}^p\)`, covariance matrix `\(\boldsymbol{\Sigma}\)` is shared among groups `$$\begin{array}{rcl} \text{latent space } & \mathbf{Z}_i \sim \mathcal{N}(\mathbf{o}_i + \mathbf{x}_i^\top\boldsymbol\Theta + {\boldsymbol\mu}_k \mathbf{1}_{\{i\in k\}},\boldsymbol\Sigma) \\ \text{observation space } & \mathbf{Y}_i | \mathbf{Z}_i \sim \mathcal{P}\left(\exp\{\mathbf{Z}_i\}\right). \end{array}$$` ### Goal of LDA Find the linear combinations `\(\mathbf{Z}\mathbf{v}, \mathbf{v}\in\mathbb{R}^p\)` maximizing separation between groups <img src="slides_files/figure-html/PLN_geom_lda_no_offset-1.png" style="display: block; margin: auto;" /> --- # Poisson LDA: mimick the Gaussian case ## Solution 1. Adjust a "standard" PLN with `\(\mathbf{X} \rightarrow (\mathbf{X}, \mathbf{1}_{\{i\in k\}})\)`, `\(\boldsymbol\Theta \rightarrow (\boldsymbol\Theta, \mathbf{U} = ({\boldsymbol\mu}_1, \dots, {\boldsymbol\mu}_K))\)` 2. Use estimate `\(\boldsymbol\Sigma, \mathbf{U}\)` and `\(\boldsymbol\Theta\)` and `\(\tilde{\mathbf{Z}}_i\)` to compute `$$\hat{\boldsymbol\Sigma}_{\text{between}} = \frac1{K-1} \sum_k n_k (\hat{\boldsymbol\mu}_k - \hat{\boldsymbol\mu}_\bullet) (\hat{\boldsymbol\mu}_k - \hat{\boldsymbol\mu}_\bullet)^\intercal$$` 3. Compute first `\(K-1\)` eigenvectors of `\(\hat{\boldsymbol\Sigma}^{-1} \hat{\boldsymbol\Sigma}_{\text{between}} = \mathbf{P}^\top \Lambda \mathbf{P}\)` (discriminant axes) - **Graphical representation**: - Center the estimated latent position `\(\tilde{Z} = \mathbb{E}_q [\mathbf{Z}] - \mathbf{o}_i - \mathbf{x}_i^\top {\boldsymbol\Theta}\)` - Represent `\(\tilde{Z}^\text{LDA} = \tilde{Z} \mathbf{P} \Lambda^{1/2}\)` the coordinates along the discriminant axes - **Prediction**: For each group, - Compute (variational) likelihood `\(p_k = P(\mathbf{Y}_{\text{new}} | \hat{\boldsymbol\Sigma}, \hat{\boldsymbol\Theta}, \hat{\boldsymbol\mu}_k)\)` - Assign `\(i\)` to group with highest posterior probability `\(\pi_k \propto \frac{n_k p_k}{n}\)` --- # LDA of the oaks data set Use the `tree` variable for grouping (`grouping` is a factor of group to be considered) ```r myLDA_tree <- PLNLDA(Abundance ~ 1 + offset(log(Offset)), grouping = tree, data = oaks) ``` no need for model selection! <small> ``` ## Linear Discriminant Analysis for Poisson Lognormal distribution ## ================================================================== ## nb_param loglik BIC ICL ## 570 -31518.36 -32873.14 -36594.05 ## ================================================================== ## * Useful fields ## $model_par, $latent, $latent_pos, $var_par, $optim_par ## $loglik, $BIC, $ICL, $loglik_vec, $nb_param, $criteria ## * Useful S3 methods ## print(), coef(), sigma(), vcov(), fitted(), predict(), standard_error() ## * Additional fields for LDA ## $percent_var, $corr_map, $scores, $group_means ## * Additional S3 methods for LDA ## plot.PLNLDAfit(), predict.PLNLDAfit() ``` </small> --- # LDA for oaks: covariance model PLN-LDA can account for various model of the covariance (spherical, diagonal, full) ```r LDA_oaks1 <- PLNLDA(Abundance ~ 1 + offset(log(Offset)), data = oaks, grouping = tree, control = list(covariance = "full")) LDA_oaks2 <- PLNLDA(Abundance ~ 1 + offset(log(Offset)), data = oaks, grouping = tree, control = list(covariance = "diagonal")) ``` .pull-left[ <!-- --> ] .pull-right[ <!-- --> ] --- # LDA for oaks: prediction If abundance data for new data are avalaible, their group can be predicted using the `predict` function. We illustrate this on our sample<sup>1</sup> and compare predicted seasons to actual ones ```r predictions <- predict(LDA_oaks1, newdata = oaks, type = "response") table(predictions, oaks$tree) ``` ``` ## ## predictions susceptible intermediate resistant ## susceptible 39 0 0 ## intermediate 0 38 0 ## resistant 0 0 39 ``` .footnote[ [1] Predicting the tree of samples used to train the model is bad practice in general and prone to overfitting, we do it for illustrative purposes only. ] --- class: inverse, center, middle # Clustering for counts <br /> .small[PLN mixture model] --- # Multivariate Poisson Mixture models - a PLN model with an additional layer assuming that the latent observations are drawn from a mixture of `\(K\)` multivariate Gaussian components. - Each component `\(k\)` have a prior probability `\(p(i \in k) = \pi_k\)` such that `\(\sum_k \pi_k = 1\)`. Let `\(C_i\in \{1,\dots,K\}\)` be the multinomial variable describing the component of `\(i\)`, then `$$\begin{array}{rrcl} \text{latent layer 1:} & C_i & \sim & \mathcal{M}(1,\pi = (\pi_1,\dots,\pi_K)), \\ \text{latent layer 2:} & \mathbf{Z}_i \mid C_i = k & \sim & \mathcal{N}({\bar{\boldsymbol\mu}}_k + \mathbf{o}_i + \mathbf{x}_i^\top\boldsymbol\Theta, \boldsymbol\Sigma_k), \quad k = 1,\dots,K. \\ \text{observation space } & \mathbf{Y}_i | \mathbf{Z}_i & \sim & \mathcal{P}\left(\exp\{\mathbf{Z}_i\}\right). \end{array}$$` The unkwown parameters are - `\(\boldsymbol\Theta\)`, the matrix of regression parameters - `\({\boldsymbol\mu}_k\)`, the vector of means in group `\(k\)` - `\({\boldsymbol\Sigma}_k\)`, the variance-covariance matrix in group `\(k\)` - `\({\boldsymbol\pi}\)` the vector of mixture proportion - `\(C_i\)`, the group membership of sample `\(i\)` - Estimated by `\(\tau_{ik} = \hat{\mathbb{P}}(C_i = k|\mathbf{Y}_i)\)`, some additional variatonal parameters for the mixture --- # PLN mixture: additional details ## Parametrization of the covariance matrices When `\(p\)` is large, general forms of `\({\boldsymbol\Sigma}^{(k)}\)` lead to a prohibitive number of parameters `\(\rightsquigarrow\)` We include constrained parametrizations of the covariance matrices to reduce the computational burden and avoid over-fitting: `$$\begin{array}{rrcll} \text{no restriction:} & \boldsymbol\Sigma_k & = & \text{symmetric} & \text{($ K p (p+1)/2$ parameters),} \\ \text{diagonal covariances:} & \boldsymbol\Sigma_k & = &\mathrm{diag}({d}_k) & \text{($2 K p$ parameters),} \\ \text{spherical covariances:} & \boldsymbol\Sigma_k & = & \sigma_k^2 {I} & \text{($K (p + 1)$ parameters).} \\ \end{array}$$` ## Features - **Optimization**: _"Similar"_ variational framework - Weighted sum of PLN fits with .important[weights given by the posterior probabilities] `\(\tau_{ik}\)` - **Model selection**: variational BIC/ICL, need restart and smoothing - `\(\tilde{\text{BIC}}_K = J(\theta, K) - \frac12 \log(n) \mathrm{\#param}\)` - `\(\tilde{\text{ICL}}_K = \tilde{\text{BIC}}_K - \mathcal{H}_{\mathcal{N}}(K) - \mathcal{H}_{\mathcal{M}}(K)\)` (two layers) --- # Clustering of the oaks samples ```r PLN_mixtures <- PLNmixture(Abundance ~ 1 + offset(log(Offset)), data = oaks, clusters = 1:5) ``` The ouput is of class (`PLNmixturefamily`). It a collection of `R` objects: ```r PLN_mixtures ``` ``` ## -------------------------------------------------------- ## COLLECTION OF 5 POISSON LOGNORMAL MODELS ## -------------------------------------------------------- ## Task: Mixture Model ## ======================================================== ## - Number of clusters considered: from 1 to 5 ## - Best model (regarding BIC): cluster = 5 ## - Best model (regarding ICL): cluster = 5 ``` `PLNmixturesfamily` has three methods: `plot`, `getModel`, `getBestModel`<sup>1</sup> .footnote[[1] Additional help can be found with `?PLNmixturefamily`, `?getBestModel.PLNmixturefamily`, `?plot.PLNmixturefamily`] --- # Clustering analysis: model selection (I) The plot function gives you hints about the "right" rank/subspace size of your data ```r plot(PLN_mixtures, criteria = c("loglik", "ICL", "BIC"), reverse = TRUE) ``` <!-- --> --- # Clustering analysis: model selection (II) To extract a particular model from the collection, use `getBestModel`: ```r myPLN_mix <- getModel(PLN_mixtures, 3) ``` The extracted object has class `PLNmixturefit`. It contains various information plus `\(K\)` components with class `PLNfit`. <small> ```r myPLN_mix ``` ``` ## Poisson Lognormal mixture model with 3 components and spherical covariances. ## * Useful fields ## $posteriorProb, $memberships, $mixtureParam, $group_means ## $model_par, $latent, $latent_pos, $optim_par ## $loglik, $BIC, $ICL, $loglik_vec, $nb_param, $criteria ## $component[[i]] (a PLNfit with associated methods and fields) ## * Useful S3 methods ## print(), coef(), sigma(), fitted(), predict() ``` ```r myPLN_mix$components[[1]] ``` ``` ## A multivariate Poisson Lognormal fit with spherical covariance model. ## ================================================================== ## nb_param loglik BIC ICL ## 115 -14705.56 -14978.89 -16879.44 ## ================================================================== ## * Useful fields ## $model_par, $latent, $latent_pos, $var_par, $optim_par ## $loglik, $BIC, $ICL, $loglik_vec, $nb_param, $criteria ## * Useful S3 methods ## print(), coef(), sigma(), vcov(), fitted(), predict(), standard_error() ``` </small> --- # Clustering analysis: model exploration and vizualisation .pull-left[ ```r myPLN_mix$plot_clustering_pca() ``` <!-- --> ] .pull-right[ ```r myPLN_mix$plot_clustering_data() ``` <!-- --> ] --- # Clustering analysis: validation? <img src="slides_files/figure-html/PLN clustering ARI-1.png" style="display: block; margin: auto;" /> <small> ```r table(myPLN_mix$memberships, oaks$tree) ``` ``` ## ## susceptible intermediate resistant ## 1 0 0 39 ## 2 39 0 0 ## 3 0 38 0 ``` </small> --- # Clustering on corrected data I ```r PLN_mixtures_tree <- PLNmixture(Abundance ~ 0 + tree + distTOground + offset(log(Offset)), data = oaks, clusters = 1:5) ``` ```r plot(PLN_mixtures_tree, criteria = c("loglik", "BIC"), reverse = TRUE) ``` <!-- --> --- # Clustering on corrected data II .pull-left[ ```r myPLN_mix_cov$plot_clustering_pca() ``` <!-- --> ] .pull-right[ ```r myPLN_mix_cov$plot_clustering_data() ``` <!-- --> ] --- class: inverse, center, middle # Dimension reduction and vizualisation for counts <br/> .small[See Chiquet, Mariadassou, and Robin [CMR18]] --- # Poisson Lognormal model for PCA The PLN-PCA [CMR18] model implemented in *PLNmodels* can viewed as a PLN model with an additional rank constraint on the covariance matrix `\(\boldsymbol\Sigma\)` such that `\(\mathrm{rank}(\boldsymbol\Sigma)= q\)`: `$$\begin{array}{rcl} \text{latent space } & \mathbf{Z}_i \sim \mathcal{N}(\mathbf{o}_i + \mathbf{x}_i^\top\boldsymbol\Theta,\boldsymbol\Sigma), & \boldsymbol\Sigma = \mathbf{B}\mathbf{B}^\top, \quad \mathbf{B}\in\mathcal{M}_{pq} \\ \text{observation space } & Y_{ij} | Z_{ij} \quad \text{indep.} & Y_{ij} | Z_{ij} \sim \mathcal{P}\left(\exp\{Z_{ij}\}\right), \end{array}$$` The dimension `\(q\)` of the latent space corresponds to the number of axes in the PCA or, in other words, to the rank of `\(\boldsymbol\Sigma = \mathbf{B}\mathbf{B}^\intercal\)`. The unkwown parameters are `\(\boldsymbol\Theta\)` and `\(\mathbf{B}\)`, the matrix of .important[_rescaled loadings_] ### Features - **Optimization**: _"Similar"_ variational framework (with different gradients) - **Model selection**: variational BIC/ICL - `\(\tilde{\text{BIC}}_q = J(\theta, q) - \frac12 \log(n) \left(p (d + q) - q(q-1)/2\right)\)` - `\(\tilde{\text{ICL}}_q = \tilde{\text{BIC}}_q - \mathcal{H}(q)\)` - **Vizualization:** PCA on the expected latent position `\(\mathbb{E}_{q}(\mathbf{Z}_i) -\mathbf{o}_i + \mathbf{x}_i^\top\boldsymbol\Theta = \mathbf{M}\hat{\mathbf{B}}^\top\)` --- # A PCA analysis of the oaks data set Let us fit PLNPCA on our best model up to now (with TSS as offsets): ```r PCA_offset <- PLNPCA(Abundance ~ 1 + offset(log(Offset)), data = oaks, ranks = 1:30) ``` The ouput is of class (`PLNPCAfamily`). It a collection of `R` objects: ```r PCA_offset ``` ``` ## -------------------------------------------------------- ## COLLECTION OF 30 POISSON LOGNORMAL MODELS ## -------------------------------------------------------- ## Task: Principal Component Analysis ## ======================================================== ## - Ranks considered: from 1 to 30 ## - Best model (greater BIC): rank = 29 ## - Best model (greater ICL): rank = 29 ``` `PLNPCAfamily` has three methods: `plot`, `getModel`, `getBestModel`<sup>1</sup> .footnote[[1] Additional help can be found with `?PLNPCAfamily`, `?getBestModel.PLNPCAfamily`, `?plot.PLNPCAfamily`] --- # PCA analysis: model selection (I) The plot function gives you hints about the "right" rank/subspace size of your data ```r plot(PCA_offset, reverse = TRUE) ``` <!-- --> --- # PCA analysis: model selection (II) To extract a particular model from the collection, use `getBestModel`: ```r PCA_offset_BIC <- getBestModel(PCA_offset, "BIC") ``` The extracted object has class `PLNPCAfit`. It inherits from the `PLNfit` class but with additional methods due to its `PCA` nature: when printing `PCA_offset_BIC`, we get ``` ## Poisson Lognormal with rank constrained for PCA (rank = 29) ## ================================================================== ## nb_param loglik BIC ICL ## 3014 -26883.67 -34047.33 -31260.38 ## ================================================================== ## * Useful fields ## $model_par, $latent, $latent_pos, $var_par, $optim_par ## $loglik, $BIC, $ICL, $loglik_vec, $nb_param, $criteria ## * Useful S3 methods ## print(), coef(), sigma(), vcov(), fitted(), predict(), standard_error() ## * Additional fields for PCA ## $percent_var, $corr_circle, $scores, $rotation, $eig, $var, $ind ## * Additional S3 methods for PCA ## plot.PLNPCAfit() ``` --- # PCA analysis: model exploration Inheritance allows you to rely on the same methods as with `PLN`: .pull-left[ ```r corrplot( cov2cor(sigma(M01_oaks)), tl.cex = .5) ``` <!-- --> ] .pull-right[ ```r corrplot( cov2cor(sigma(PCA_offset_BIC)), tl.cex = .5) ``` <!-- --> ] --- # PCA: vizualisation <small> ```r factoextra::fviz_pca_biplot( PCA_offset_BIC, select.var = list(contrib = 10), addEllipses = TRUE, habillage = oaks$tree, title = "Biplot (10 most contributing species)" ) + labs(col = "tree status") + scale_color_viridis_d() ``` <img src="slides_files/figure-html/PCA offset vizu tree-1.png" style="display: block; margin: auto;" /> </small> --- # PCA: removing covariate effects To hopefully find some hidden effects in the data, we can try to remove confounding ones: ```r PCA_tree <- PLNPCA(Abundance ~ 0 + tree + offset(log(Offset)), data = oaks, ranks = 1:30) ``` <img src="slides_files/figure-html/PCA covariate tree plot-1.png" style="display: block; margin: auto;" /> --- class: center, middle, inverse # Sparse structure estimation for counts <br/> .small[See Chiquet, Mariadassou, and Robin [CMR19]] --- # .small[Background on Gaussian Graphical Models] Suppose `\(\mathbf{Z} \sim \mathcal{N}(\mathbf{0}, {\boldsymbol\Omega}^{-1}=\boldsymbol\Sigma)\)` ### Conditional independence structure `$$(i,j) \notin \mathcal{E} \Leftrightarrow Z_j \perp Z_k | Z_{\backslash \{j,k\}} \Leftrightarrow {\boldsymbol\Omega}_{jk} = 0.$$` .pull-left[ `\(\mathcal{G}=(\mathcal{P},\mathcal{E})\)` ] .pull-right[ `\(\boldsymbol\Omega\)` ] ### Graphical-Lasso Network reconstruction is (roughly) a variable selection problem `$$\hat{{\boldsymbol\Omega}}_\lambda=\arg\max_{\boldsymbol\Omega \in \mathbb{S}_+} \ell({\boldsymbol\Omega};\mathbf{Y})-\lambda \|{\boldsymbol\Omega}\|_{1}$$` --- # Sparse precision for multivariate counts The PLN-network model add a sparsity constraint on the precision matrix `\({\boldsymbol\Sigma}^{-1}\triangleq \boldsymbol\Omega\)`: `$$\begin{array}{rcl} \text{latent space } & \mathbf{Z}_i \sim \mathcal{N}\left({\mathbf{o}_i + \mathbf{x}_i^\top\boldsymbol\Theta},\boldsymbol\Omega^{-1}\right) & \|\boldsymbol\Omega\|_1 < c \\ \text{observation space } & \mathbf{Y}_i | \mathbf{Z}_i \sim \mathcal{P}\left(\exp\{\mathbf{Z}_i\}\right) \end{array}$$` `\(\rightsquigarrow\)` The `\(\ell_1\)`-penalty induces selection of direct relations (an underlying network) ## .small[Variational approximation] `$$J(\theta, q) - \lambda \| \boldsymbol\Omega\|_{1,\text{off}} = \mathbb{E}_{q} [\log p_\theta(\mathbf{Y}, \mathbf{Z})] + \mathcal{H}[q(\mathbf{Z})] - \lambda \|\boldsymbol\Omega\|_{1, \text{off}}$$` Still bi-concave in `\((\boldsymbol\Omega, \boldsymbol\Theta)\)` and `\((\mathbf{M}, \mathbf{S})\)`. Solving in `\(\boldsymbol\Omega\)` leads to `$$\hat{\boldsymbol\Omega} = \arg\max_{\boldsymbol\Omega} \frac{n}{2} \left(\log | \boldsymbol\Omega | - \text{trace}(\hat{\boldsymbol\Sigma} \boldsymbol\Omega)\right) - \lambda \|\boldsymbol\Omega\|_{1, \text{off}}: \quad \text{graphical-Lasso problem}$$` with `\(\hat{\boldsymbol\Sigma} = n^{-1}(\mathbf{M}^\top \mathbf{M} + \mathrm{diag}(\bar{\mathbf{S}}^2)\)`. --- # Network inference on the oaks data set - Infer a collections of networks indexed by the sparsity accounting for the tree effect. `$$\text{EBIC}_\gamma(\hat{\boldsymbol\Omega}_\lambda) = -2 \textrm{loglik} (\hat{\boldsymbol\Omega}) + \log(n) (|\mathcal{E}_\lambda| + p d) + \gamma \log {p(p+1)/2 \choose |\mathcal{E}_\lambda|},$$` .pull-left[ where `\(\mathcal{E}_\lambda\)` is the number of selected edge at `\(\lambda\)`. <br /> ```r networks_oaks_tree <- PLNnetwork( Abundance ~ 0 + tree + offset(log(Offset)), data = oaks ) ``` ] .pull-right[ <!-- --> ] --- # PLNnetwork: field access Let us plot the estimated correlation matrix, after regularization of its inverse, and the corresponding network of partial correlation. .pull-left[ <!-- --> ] .pull-right[ <!-- --> ] --- # PLNnetwork: stability selection An alternative to model selection criteria is the stability selection - or StARS in the context of network. 1. Infers `\(B\)` networks `\({\boldsymbol\Omega}^{(b, \lambda)}\)` on subsamples of size `\(m\)` for varying `\(\lambda\)`. 2. Frequency of inclusion of each edges `\(e = i\sim j\)` is estimated by `$$p_e^\lambda = \# \{b: \Omega^{(b, \lambda)}_{ij} \neq 0\}/B$$` 3. Variance of inclusion of edge `\(e\)` is `\(v_e^\lambda = p_e^\lambda (1 - p_e^\lambda)\)`. 4. Network stability is `\(\textrm{stab}(\lambda) = 1 - 2\bar{v}^\lambda\)` where `\(\bar{v}^\lambda\)` is the average of the `\(v_e^\lambda\)`. <br /> `\(\rightsquigarrow\)` StARS<sup>1</sup> selects the smallest `\(\lambda\)` (densest network) for which `\(\textrm{stab}(\lambda) \geq 1 - 2\beta\)` .footnote[Liu, Roeder, and Wasserman [LRW10] suggest using `\(2\beta = 0.05\)` and `\(m = \lfloor 10 \sqrt{n}\rfloor\)` based on theoretical results.] --- # StARS In `getBestModel`, when "StARS" is requested, stabilitiy selection is performed if needed: ```r stability_selection(networks_oaks_tree) ``` .pull-left[ <!-- --> ] .pull-right[ <!-- --> ] --- # Conclusion ## Summary - PLN = generic model for multivariate count data analysis - Flexible modeling of the covariance structure, allows for covariates - Efficient V-EM algorithm ## Extensions - Other variants - zero inflation (data with _a lot_ of zeros) - covariance structures (spatial, time series, ...) - Variable selection ($\ell_1$-penalty on the regression coefficients) - Other optimization approaches - large scale problem: ELBO + SG variant/GPU - Composite likelihood, starting from the variational solution - .important[log-likelihood] (MCMC, Stochastic gradient) - Pave the way for Confidence interval and tests for regular PLN --- # References Aitchison, J. and C. Ho (1989). "The multivariate Poisson-log normal distribution". In: _Biometrika_ 76.4, pp. 643-653. Chiquet, J., M. Mariadassou, and S. Robin (2018). "Variational inference for probabilistic Poisson PCA". In: _The Annals of Applied Statistics_ 12, pp. 2674-2698. URL: [http://dx.doi.org/10.1214/18-AOAS1177](http://dx.doi.org/10.1214/18-AOAS1177). Chiquet, J., M. Mariadassou, and S. Robin (2019). "Variational inference for sparse network reconstruction from count data". In: _Proceedings of the 19th International Conference on Machine Learning (ICML 2019)_. Inouye, D. I., E. Yang, G. I. Allen, et al. (2017). "A review of multivariate distributions for count data derived from the Poisson distribution". In: _Wiley Interdisciplinary Reviews: Computational Statistics_ 9.3. Jakuschkin, B., V. Fievet, L. Schwaller, et al. (2016). "Deciphering the pathobiome: intra-and interkingdom interactions involving the pathogen Erysiphe alphitoides". In: _Microbial ecology_ 72.4, pp. 870-880. Johnson, S. G. (2011). _The NLopt nonlinear-optimization package_. URL: [http://ab-initio.mit.edu/nlopt](http://ab-initio.mit.edu/nlopt). Karlis, D. (2005). "EM algorithm for mixed Poisson and other discrete distributions". In: _Astin bulletin_ 35.01, pp. 3-24. Liu, H., K. Roeder, and L. Wasserman (2010). "Stability Approach to Regularization Selection (StARS) for High Dimensional Graphical Models". In: _Proceedings of the 23rd International Conference on Neural Information Processing Systems - Volume 2_. USA, pp. 1432-1440. Mardia, K., J. Kent, and J. Bibby (1979). _Multivariate analysis_. Academic press. Svanberg, K. (2002). "A class of globally convergent optimization methods based on conservative convex separable approximations". In: _SIAM journal on optimization_ 12.2, pp. 555-573. Wainwright, M. J. and M. I. Jordan (2008). "Graphical Models, Exponential Families, and Variational Inference". In: _Found. Trends Mach. Learn._ 1.1-2, pp. 1-305. Westling, T. and T. H. McCormick (2015). _Beyond prediction: A framework for inference with variational approximations in mixture models_. arXiv: 1510.08151 [stat.ME].