Poisson lognormal models for count data

class: center, middle, inverse, title-slide

.title[
# Poisson lognormal models for count data
]
.subtitle[
## Focus on some probabilistic factor models: Poisson LDA and Poisson PCA
]
.author[
### J. Chiquet, M. Mariadassou, S. Robin, B. Batardière + others<br /><br /> <small>MIA Paris-Saclay, AgroParisTech, INRAE, Sorbonne Universyty </small> <br /> <small>Last update 29 mars, 2023</small>
]
.date[
### <br/><a href="https://pln-team.github.io/PLNmodels" class="uri">https://pln-team.github.io/PLNmodels</a>
]

---

class: inverse, middle

# Outline

1. .large[**Framework** of multivariate Poisson lognormal models] 
  <br/>

2. .large[**PLN estimation** with variational inference] 
  <br/>

3. .large[**Focus** on factor analysis with PLN (LDA, PCA)] 
  <br/>
  
  4. .large[**Illustration** on a scRNA dataset]

---
class: inverse, center, middle

# Multivariate Poisson lognormal models <br/> .small[Motivations, Framework]

---

# Generic form of data sets

Routinely gathered in ecology/microbiology/genomics

### Data tables

- .important[Abundances]: read counts of species/transcripts `$j$` in sample `$i$`
  - .important[Covariates]: value of environmental variable `$k$` in sample `$i$`
  - .important[Offsets]: sampling effort for species/transcripts `$j$` in sample `$i$`

### Need frameworks to model _dependencies between counts_

- understand **environmental effects** <br/>
      `$\rightsquigarrow$` explanatory models (multivariate regression, classification)
  - exhibit **patterns of diversity** <br/>
      `$\rightsquigarrow$` summarize the information (clustering, dimension reduction)
  - understand **between-species interactions** <br />
      `$\rightsquigarrow$` 'network' inference (variable/covariance selection)
  - correct for technical and **confounding effects** <br/>
      `$\rightsquigarrow$` account for covariables and sampling effort

---

# Illustration in genomics

## scRNA data set

The dataset `scRNA` contains the counts of the 500 most varying transcripts in the mixtures of 5 cell lines in human liver (obtained with standard 10x scRNAseq Chromium protocol).

We subsample 500 random cells and the keep the 200 most varying genes

```r
data(scRNA); set.seed(1234)
train_set <- sample.int(nrow(scRNA), 500)
test_set  <- setdiff(1:nrow(scRNA), train_set)
scRNA_train <- scRNA[train_set, ]
scRNA_test  <- scRNA[test_set, ]
scRNA_train$counts <- scRNA_train$counts[, 1:200]
scRNA_test$counts  <- scRNA_test$counts[, 1:200]
```

### Covariates

- `cell_line`: the cell line of the current row (among 5)
- `total_counts`: Total number reads for that cell

---

# Table of Counts

.pull-left[
Matrix view (log-transform + total-counts normalization)

![log-scaled counts](slides_files/figure-html/glance counts-1.png)
]

.pull-right[
Histogram (log-transform + total-counts normalization)
![](slides_files/figure-html/glimpse Abundance-1.png)
]

---

# Models for multivariate count data

### If we were in a Gaussian world...

The .important[general linear model] [MKB79] would be appropriate! For each sample `$i = 1,\dots,n$`,

`$$\underbrace{\mathbf{Y}_i}_{\text{abundances}} =  \underbrace{\mathbf{x}_i^\top \mathbf{B}}_{\text{covariates}} + \underbrace{\mathbf{o}_i}_{\text{sampling effort}} + \boldsymbol\varepsilon_i, \quad \boldsymbol\varepsilon_i \sim \mathcal{N}(\mathbf{0}_p, \underbrace{\boldsymbol\Sigma}_{\text{between-species dependencies}})$$`

null covariance `$\Leftrightarrow$` independence `$\rightsquigarrow$` uncorrelated species/transcripts do not interact

.content-box-red[This model gives birth to Principal Component Analysis,
 Discriminant Analysis, Gaussian Graphical Models, Gaussian Mixture models and many others `$\dots$`]

### With count data...

There is no generic model for multivariate counts
  - Data transformation (log, `$\sqrt{}$`) : quick and dirty
  - Non-Gaussian multivariate distributions [Ino+17]: do not scale to data dimension yet
  - .important[Latent variable models]: interaction occur in a latent (unobserved) layer

---

# The Poisson Lognormal model (PLN)

The PLN model [AH89] is a .important[multivariate generalized linear model], where

- the counts `$\mathbf{Y}_i$` are the response variables
- the main effect is due to a linear combination of the covariates `$\mathbf{x}_i$`
- a vector of offsets `$\mathbf{o}_i$` can be specified for each sample.

.content-box-red[
$$
\mathbf{Y}_i | \mathbf{Z}_i \sim \mathcal{P}\left(\exp\{\mathbf{Z}_i\}\right), \qquad \mathbf{Z}_i \sim \mathcal{N}({\mathbf{o}_i + \mathbf{x}_i^\top\mathbf{B}},\boldsymbol\Sigma), \\
$$
]
.pull-left[The unkwown parameters are 
- `$\mathbf{B}$`, the regression parameters
- `$\boldsymbol\Sigma$`, the variance-covariance matrix
]

.pull-right[
Stacking all individuals together, 
  - `$\mathbf{Y}$` is the `$n\times p$` matrix of counts  
  - `$\mathbf{X}$` is the `$n\times d$` matrix of design
  - `$\mathbf{O}$` is the `$n\times p$` matrix of offsets
]

### Properties: .small[.important[over-dispersion, arbitrary-signed covariances]]

- mean: `$\mathbb{E}(Y_{ij}) =  \exp \left( o_{ij} + \mathbf{x}_i^\top {\mathbf{B}}_{\cdot j} + \sigma_{jj}/2\right) >  0$`
- variance: `$\mathbb{V}(Y_{ij}) = \mathbb{E}(Y_{ij}) + \mathbb{E}(Y_{ij})^2 \left( e^{\sigma_{jj}} - 1 \right) > \mathbb{E}(Y_{ij})$`
- covariance: `$\mathrm{Cov}(Y_{ij}, Y_{ik}) = \mathbb{E}(Y_{ij}) \mathbb{E}(Y_{ik}) \left( e^{\sigma_{jk}} - 1 \right).$`

---
class: inverse, center, middle

# Variational inference for standard PLN<br/> .small[Optimisation]

---
# PLN Inference: general ingredients
  
Estimate `$\theta = (\mathbf{B}, \boldsymbol\Sigma)$`, predict the `$\mathbf{Z}_i$`, while  the model marginal likelihood is

`$$p_\theta(\mathbf{Y}_i) = \int_{\mathbb{R}_p} \prod_{j=1}^p p_\theta(Y_{ij} | Z_{ij}) \, p_\theta(\mathbf{Z}_i) \mathrm{d}\mathbf{Z}_i$$`

### Expectation-Maximization

With `$\mathcal{H}(p) = -\mathbb{E}_p(\log(p))$` the entropy of `$p$`,

`$$\log p_\theta(\mathbf{Y}) = \mathbb{E}_{p_\theta(\mathbf{Z}\,|\,\mathbf{Y})} [\log p_\theta(\mathbf{Y}, \mathbf{Z})] + \mathcal{H}[p_\theta(\mathbf{Z}\,|\,\mathbf{Y})]$$`

EM requires to evaluate (some moments of) `$p_\theta(\mathbf{Z} \,|\,  \mathbf{Y})$`, but there is no close form!

### Variational approximation [WJ08]

Use a proxy `$q_\psi$` of `$p_\theta(\mathbf{Z}\,|\,\mathbf{Y})$` minimizing a divergence in a class `$\mathcal{Q}$` .small[(e.g, Küllback-Leibler divergence)]

`$$q_\psi(\mathbf{Z})^\star  \arg\min_{q\in\mathcal{Q}} D\left(q(\mathbf{Z}), p(\mathbf{Z} | \mathbf{Y})\right), \, \text{e.g.}, D(.,.) = KL(., .) = \mathbb{E}_{q_\psi}\left[\log \frac{q(z)}{p(z)}\right].$$`

---
# Inference: specific ingredients

Consider `$\mathcal{Q}$` the class of diagonal multivariate Gaussian distributions:

`$$\Big\{q: \, q(\mathbf{Z}) = \prod_i q_i(\mathbf{Z}_i), \, q_i(\mathbf{Z}_i) = \mathcal{N}\left(\mathbf{Z}_i; \mathbf{m}_i, \mathrm{diag}(\mathbf{s}_i \circ \mathbf{s}_i)\right), \boldsymbol\psi_i = (\mathbf{m}_i, \mathbf{s}_i) \in\mathbb{R}_p\times\mathbb{R}_p \Big\}$$`

and maximize the ELBO (Evidence Lower BOund)

`$$\begin{aligned}J(\theta, \psi) & = \log p_\theta(\mathbf{Y}) - KL[q_\psi (\mathbf{Z}) ||  p_\theta(\mathbf{Z} | \mathbf{Y})] \\ & = \mathbb{E}_{\psi} [\log p_\theta(\mathbf{Y}, \mathbf{Z})] + \mathcal{H}[q_\psi(\mathbf{Z})] \\  & 
= \frac{1}{n} \sum_{i = 1}^n J_i(\theta, \psi_i),\end{aligned}$$`

where, letting `$\mathbf{A}_i = \mathbb{E}_{q_i}[\exp(Z_i)] = \exp\left( \mathbf{o}_i + \mathbf{m}_i + \frac{1}{2}\mathbf{s}^2_i\right)$`, we have

`$$\begin{aligned}
J_i(\theta, \psi_i) = &\mathbf{Y}_i^\intercal(\mathbf{o}_i + \mathbf{m}_i) - \left(\mathbf{A}_i - \frac{1}{2}\log(\mathbf{s}^2_i)\right) ^\intercal \mathbf{1}_p + \frac{1}{2} |\log|{\boldsymbol\Omega}|  \\
& - \frac{1}{2}(\mathbf{m}_i - \boldsymbol{\Theta}\mathbf{x}_i)^\intercal \boldsymbol{\Omega} (\mathbf{m}_i - \boldsymbol{\Theta}\mathbf{x}_i) - \frac{1}{2} \mathrm{diag}(\boldsymbol\Omega)^\intercal\mathbf{s}^2_i + \mathrm{cst} 
\end{aligned}$$`

---
# Resulting Variational EM

.important[Alternate] until convergence between

- VE step: optimize `$\boldsymbol{\psi}$` (can be written individually)
`$$\psi_i^{(h)} = \arg \max J_{i}(\theta^{(h)}, \psi_i) \left( = \arg\min_{q_i} KL[q_i(\mathbf{Z}_i) \,||\, p_{\theta^h}(\mathbf{Z}_i\,|\,\mathbf{Y}_i)] \right)$$`
  - M step: optimize `$\theta$`
`$$\theta^{(h)} = \arg\max \frac{1}{n}\sum_{i=1}^{n}J_{Y_i}(\theta, \psi_i^{(h)})$$`

We end up with a `$M$`-estimator:

$$
`\begin{equation}
\hat{\theta}^{\text{ve}} = \arg\max_{\theta} \left( \frac{1}{n}\sum_{i=1}^{n} \sup_{\psi_i} J_i(\theta, \psi_i) \right)  = \arg\max_{\theta} \underbrace{\left(\frac{1}{n}\sum_{i=1}^{n} \bar{J}_i(\theta) \right)}_{\bar{J}_n(\theta)} 
\end{equation}`
$$
`$\hat{\theta}^{\text{ve}}$` is asymptotically unbiased. .important[Variance can be estimated with bootstrap/Jackknife.]

---
# Optimization of simple PLN models

### Property of the objective function
  
The ELBO `$J(\theta, \psi)$` is bi-concave, i.e.
  - concave wrt `$\psi = (\mathbf{M}, \mathbf{S})$` for given `$\theta$` 
  - convace wrt `$\theta = (\boldsymbol\Sigma, \mathbf{B})$` for given `$\psi$`

but .important[not jointly concave] in general.

### M-step: analytical

`$$\hat{{\mathbf{B}}} = \left(\mathbf{X}^\top \mathbf{X}\right)^{-1} \mathbf{X} \mathbf{M}, \quad 
   \hat{{\boldsymbol\Sigma}} = \frac{1}{n} \left(\mathbf{M}-\mathbf{X}\hat{{\mathbf{B}}}\right)^\top \left(\mathbf{M}-\mathbf{X}\hat{\mathbf{B}}\right) + \frac{1}{n} \mathrm{diag}(\mathbf{1}^\intercal\mathbf{S}^2)$$`

### VE-step: gradient ascent

`$$\frac{\partial J(\psi)}{\partial \mathbf{M}} =  \left(\mathbf{Y} - \mathbf{A} - (\mathbf{M} - \mathbf{X}{\mathbf{B}}) \mathbf{\Omega}\right), \qquad \frac{\partial J(\psi)}{\partial \mathbf{S}} = \frac{1}{\mathbf{S}} - \mathbf{S} \circ \mathbf{A} - \mathbf{S} \mathrm{D}_{\boldsymbol\Omega} .$$`
`$\rightsquigarrow$` Same routine for other PLN variants.

---
# Implementations

#### Medium scale problems (R/C++ package)

- **algorithm**: conservative convex separable approximations [Sva02]
  - **implementation**: `NLopt` nonlinear-optimization library [Joh11] <br/>
`$\rightsquigarrow$` Up to thousands of sites ( `$n \approx 1000s$` ), hundreds of species ( `$p\approx 100s$` )

#### Large scale problems  (Python/Pytorch module)

- **algorithm**: `Rprop` (gradient sign + adaptive variable-specific update) [RB93]
  - **implementation**: `torch` with GPU auto-differentiation [FL22; Pas+17] <br/>
`$\rightsquigarrow$` Up to  `$n \approx 100,000$` and `$p\approx 10,000s$`

<div class="figure" style="text-align: center">
<img src="figs/final_n=10000,p=2000.png" alt="n = 10,000, p = 2,000, d = 2 (running time: 1 min 40s)" width="30%" />
<p class="caption">n = 10,000, p = 2,000, d = 2 (running time: 1 min 40s)</p>
</div>

---
# Availability

### Help and documentation

- github group <https://github.com/pln-team>
- PLNmodels website <https://pln-team.github.io/PLNmodels>

### R/C++ Package `PLNmodels`

Last stable release on CRAN,  development version available on GitHub).

```r
install.packages("PLNmodels")
remotes::install_github("PLN-team/PLNmodels@dev")
```

```r
library(PLNmodels)
packageVersion("PLNmodels")
```

```
## [1] '1.0.1'
```

### Python module `pyPLNmodels`

A PyTorch implementation is available in pyPI [https://pypi.org/project/pyPLNmodels/](https://pypi.org/project/pyPLNmodels/)

---
#  PLN with offsets and covariates

- Cell-line effect is in the regression coefficient (groupwise or common mean)
- Spurious effect regarding the interactions between genes (full or diagonal covariance).

## Offset: .small[modeling sampling effort]

The predefined offset uses the total sum of reads.

```r
M01_scRNA <- PLN(counts ~ 1 + offset(log(total_counts)), scRNA_train)
M02_scRNA <- PLN(counts ~ 1 + offset(log(total_counts)), scRNA_train,
                 control = PLN_param(covariance = "diagonal"))
```

## Covariates: .small[cell-line effect ('ANOVA'-like) ]

The `cell_line` is a natural candidate for explaining a large of the variance.

```r
M11_scRNA <- PLN(counts ~ 0 + cell_line + offset(log(total_counts)), scRNA_train)
M12_scRNA <- PLN(counts ~ 0 + cell_line + offset(log(total_counts)), scRNA_train,
                 control = PLN_param(covariance = "diagonal"))
```

---
#  PLN with offsets and covariates (2)

There is a clear gain in introducing the cell_line covariate in the model:

```r
rbind(M01 = M01_scRNA$criteria, M02 = M02_scRNA$criteria, 
      M11 = M11_scRNA$criteria, M12 = M12_scRNA$criteria) %>% 
  knitr::kable(format = "html")
```

<table>
 <thead>
  <tr>
   <th style="text-align:left;">   </th>
   <th style="text-align:right;"> nb_param </th>
   <th style="text-align:right;"> loglik </th>
   <th style="text-align:right;"> BIC </th>
   <th style="text-align:right;"> ICL </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> M01 </td>
   <td style="text-align:right;"> 20300 </td>
   <td style="text-align:right;"> -203318.4 </td>
   <td style="text-align:right;"> -266396.7 </td>
   <td style="text-align:right;"> -294420.7 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> M02 </td>
   <td style="text-align:right;"> 400 </td>
   <td style="text-align:right;"> -258421.9 </td>
   <td style="text-align:right;"> -259664.8 </td>
   <td style="text-align:right;"> -342792.9 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> M11 </td>
   <td style="text-align:right;"> 21100 </td>
   <td style="text-align:right;"> -199111.2 </td>
   <td style="text-align:right;"> -264675.3 </td>
   <td style="text-align:right;"> -289661.4 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> M12 </td>
   <td style="text-align:right;"> 1200 </td>
   <td style="text-align:right;"> -216457.5 </td>
   <td style="text-align:right;"> -220186.3 </td>
   <td style="text-align:right;"> -275057.6 </td>
  </tr>
</tbody>
</table>

---
#  PLN with offsets and covariates (3)

Looking at the coefficients `$\mathbf{B}$` associated with `cell_line` bring additional insights:

---
# Simple torch example in `R`

```r
data("oaks")
system.time(myPLN_torch <-
              PLN(Abundance ~ 1  + offset(log(Offset)),
                  data = oaks, control = PLN_param(backend = "torch", trace = 0)))
```

```
##    user  system elapsed 
##   3.378   0.438   3.067
```

```r
system.time(myPLN_nlopt <-
              PLN(Abundance ~ 1  + offset(log(Offset)),
                  data = oaks, control = PLN_param(backend = "nlopt", trace = 0)))
```

```
##    user  system elapsed 
##   7.466   0.613   6.900
```

```r
myPLN_torch$loglik
```

```
## [1] -32224.15
```

```r
myPLN_nlopt$loglik
```

```
## [1] -32097.67
```

---
class: inverse, center, middle

# Probabilistic factor models in PLN<br/> .small[Poisson LDA and Poisson PCA]

---

# Natural extensions of PLN

### Various tasks of multivariate analysis

- .important[Dimension Reduction]: rank constraint matrix `$\boldsymbol\Sigma$`.
    
    `$$\mathbf{Z}_i \sim \mathcal{N}(\boldsymbol\mu, \boldsymbol\Sigma = \mathbf{C}\mathbf{C}^\top), \quad \mathbf{C} \in \mathcal{M}_{pk} \text{ with orthogonal columns}.$$`

- .important[Classification]: maximize separation between groups with means

`$$\mathbf{Z}_i \sim \mathcal{N}(\sum_k {\boldsymbol\mu}_k \mathbf{1}_{\{i\in k\}}, \boldsymbol\Sigma), \quad \text{for known memberships}.$$`

- .important[Clustering]: mixture model in the latent space

`$$\mathbf{Z}_i \mid i \in k  \sim \mathcal{N}(\boldsymbol\mu_k, \boldsymbol\Sigma_k), \quad \text{for unknown memberships}.$$`

- .important[Network inference]: sparsity constraint on inverse covariance.
  
  `$$\mathbf{Z}_i \sim \mathcal{N}(\boldsymbol\mu, \boldsymbol\Sigma = \boldsymbol\Omega^{-1}), \quad \|\boldsymbol\Omega \|_1 < c.$$`

- .important[Variable selection]: sparsity constraint on regression coefficients
  
  `$$\mathbf{Z}_i \sim \mathcal{N}(\mathbf{x}_i^\top\mathbf{B}, \boldsymbol\Sigma), \quad \|\mathbf{B} \|_1 < c.$$`

---

# Natural extensions of PLN

### Tasks seen in Factor Analysis

- .important[Dimension Reduction]: rank constraint matrix `$\boldsymbol\Sigma$`. .important[PCA]
    
    `$$\mathbf{Z}_i \sim \mathcal{N}(\boldsymbol\mu, \boldsymbol\Sigma = \mathbf{C}\mathbf{C}^\top), \quad \mathbf{C} \in \mathcal{M}_{pk} \text{ with orthogonal columns}.$$`

- .important[Classification]: maximize separation between groups with means .important[LDA]

`$$\mathbf{Z}_i \sim \mathcal{N}(\sum_k {\boldsymbol\mu}_k \mathbf{1}_{\{i\in k\}}, \boldsymbol\Sigma), \quad \text{for known memberships}.$$`

---

# Background: Gaussian LDA

### Model

Assume the samples are distributed in `$K$` groups and note

- `$\mathbf{G}$` the group membership matrix
- `$\mathbf{U} = [{\boldsymbol\mu}_1, \dots, {\boldsymbol\mu}_K]$` the matrix of group-specific means

The model is

`$$\mathbf{Z}_i = \mathcal{N}(\mathbf{G}_i^\top \mathbf{U}, {\boldsymbol\Sigma})$$`

`$\rightsquigarrow$` Aim of LDA: Find the linear combination(s) `$\mathbf{Z} u$` maximizing separation between groups

### Solution

Find the first eigenvectors of `$\mathbf{\Sigma}_w^{-1} \mathbf{\Sigma}_b$` where

- `$\mathbf{\Sigma}_w$` is the _within_-group variance matrix, i.e. the unbiased estimated of `${\boldsymbol\Sigma}$`:
- `$\mathbf{\Sigma}_b$` is the _between_-group variance matrix, estimated from `$\mathbf{U}$`.

---

# Poisson lognormal LDA (1)

Similar to normal PLN with

- `$\mathbf{X} \rightarrow (\mathbf{X}, \mathbf{G})$`
- `$\mathbf{B} \rightarrow (\mathbf{B}, \mathbf{U})$`

### Inference

-  Use .important[variational inference] to estimate `$(\mathbf{B}, \mathbf{U})$`, `$\mathbf{\Sigma}$` and `$\mathbf{Z}_i$`

- Compute `$\hat{\mathbf{\Sigma}}_b$` as

`$$\hat{\mathbf{\Sigma}}_b = \frac1{K-1} \sum_k n_k (\hat{{\boldsymbol\mu}}_k - \hat{{\boldsymbol\mu}}_\bullet) (\hat{{\boldsymbol\mu}}_k - \hat{{\boldsymbol\mu}}_\bullet)^\intercal$$`

- Compute first `$K-1$` eigenvectors of `$\hat{{\boldsymbol\Sigma}}^{-1} \hat{\mathbf{\Sigma}}_b = \mathbf{P}^\top \Lambda \mathbf{P}$`

---

# Poisson lognormal LDA (2)

### Graphical representation

Mimick Gaussian LDA:

- Center the estimated latent positions `$\tilde{\mathbf{Z}}$`
- Compute the estimated coordinates along the discriminant axes

`$$\tilde{\mathbf{Z}}^{LDA} = \tilde{\mathbf{Z}} \mathbf{P} \Lambda^{1/2}$$`

### Prediction

For each group `$k$`
- Assume that the new sample `$\mathbf{Y}_{\text{new}}$` comes from group `$k$`
- Compute (variational) _likelihood_ `$p_k = \mathbb{P}(\mathbf{Y}_{\text{new}} | \hat{{\boldsymbol\Sigma}}, \hat{\mathbf{\Sigma}_b}, \hat{{\boldsymbol\mu}}_k)$`
- Compute posterior probability `$\pi_k \propto \frac{n_k p_k}{n}$`

`$\rightsquigarrow$` Assign to group with highest `$\pi_k$`

---

# Discriminant Analysis  (scRNA, 1)

Use the `cell-line` variable for grouping (`grouping` is a factor of group to be considered)

```r
myLDA_cell_line <- 
  PLNLDA(counts ~ 1 + offset(log(total_counts)), grouping = cell_line,
         data = scRNA_train)
```

.pull-left[
![](slides_files/figure-html/plot-lda-oaks1-1.png)
]

.pull-right[
![](slides_files/figure-html/plot-lda-oaks2-1.png)
]

---
# Discriminant Analysis  (scRNA, 2)

Consider now a diagonal covariance.

```r
myLDA_cell_line_diag <- 
  PLNLDA(counts ~ 1 + offset(log(total_counts)), grouping = cell_line,
         data = scRNA_train, control = PLN_param(covariance = "diagonal"))
```

.pull-left[
![](slides_files/figure-html/plot-lda-diag-oaks1-1.png)
]

.pull-right[
![](slides_files/figure-html/plot-lda-diag-oaks2-1.png)
]

---
# Discriminant Analysis (scRNA, 3)

We can prediction cell-line of some new data: let us try either diagonal or fully parametrized covariance.

```r
pred_cell_line      <- predict(myLDA_cell_line     ,
                               newdata = scRNA_test, type = "response")
pred_cell_line_diag <- predict(myLDA_cell_line_diag,
                               newdata = scRNA_test, type = "response")
```

The ARI on the test set is impressive.footnote[The problem should be easy though...]

```r
aricode::ARI(pred_cell_line, scRNA_test$cell_line)
```

```
## [1] 0.9770576
```

```r
aricode::ARI(pred_cell_line_diag, scRNA_test$cell_line)
```

```
## [1] 0.9683799
```

---
# Discriminant Analysis  (scRNA, 4)

Let us explore the discriminant groups:

```r
heatmap(exp(myLDA_cell_line$group_means))
```

Indeed, some groups of gene caracterize well the cell-lines.

---

# Poisson Lognormal PCA

### Model
  
`$$\begin{array}{rcl}
  \mathbf{Z}_i & \sim^\text{iid} \mathcal{N}_p({\boldsymbol 0}_p, {\boldsymbol\Sigma}), & \text{rank}({\boldsymbol\Sigma}) = k \ll p \\
  \mathbf{Y}_i \,|\, \mathbf{Z}_i & \sim \mathcal{P}(\exp\{\mathbf{O}_i + \mathbf{X}_i \mathbf{B} + \mathbf{Z}_i\})
\end{array}$$`

Recall that: `$\text{rank}({\boldsymbol\Sigma}) = q \quad \Leftrightarrow \quad \exists \mathbf{C} (p \times q): \Sigma = \mathbf{C} \mathbf{C}^\intercal$`.
  
### Estimation

Variational inference

$$\text{maximize } J({\boldsymbol\beta}, {\boldsymbol\psi}) $$
`$\rightsquigarrow$` Still bi-concave in `${\boldsymbol\beta} = (\mathbf{C}, \mathbf{B})$` and `${\boldsymbol\psi} = (\mathbf{M}, \mathbf{S})$`

---

# Model selection and Visualization

### Number of components/rank `$k$` needs to be chosen.
  
`$\log p_{\hat{\boldsymbol\beta}}(\mathbf{Y})$` intractable: use variational "likelihood"  `$J(\hat{\boldsymbol\beta}, \hat{\boldsymbol\psi})$`

- BIC `$\rightsquigarrow$` `$\text{vBIC}_k = J(\hat{\boldsymbol\beta}, \tilde{p}) - \frac12 p (d + k) \log(n)$`
- ICL `$\rightsquigarrow$` `$\text{vICL}_k = \text{vBIC}_k - \mathcal{H}(\tilde{p})$`

$$
  \hat{k} = \arg\max_k \text{vBIC}_k
  \qquad \text{or} \qquad
  \hat{k} = \arg\max_k \text{vICL}_k
$$
### Visualization

- Gaussian PCA: Optimal subspaces nested when `$q$` increases.
- PLN-pPCA: Non-nested subspaces.

For the selected dimension dimension `$\hat{k}$`:

- Compute the estimated latent positions `$\mathbb{E}_q(\mathbf{Z}_i) = \mathbf{M} \hat{\mathbf{C}}^\top$`
- Perform PCA on the `$\mathbf{M} \hat{\mathbf{C}}^\top$`
- Display result in any dimension `$k \leq \hat{k}$`

---

# PCA: Goodness of fit

.important[pPCA:] Cumulated sum of the  eigenvalues = \% of variance preserved on the first `$q$` components.

### PLN-pPCA

Deviance based criterion.

- Compute `$\tilde{\mathbf{Z}}^{(k)} = \mathbf{O} + \mathbf{X} \hat{\mathbf{B}}^\top + \mathbf{M}^{(k)} \left(\hat{\mathbf{C}}^{(k)}\right)^\top$`
- Take `$\lambda_{ij}^{(k)} = \exp\left(\tilde{Z}_{ij}^{(k)}\right)$`
- Define `$\lambda_{ij}^{\min} = \exp( \tilde{Z}_{ij}^0)$` and `$\lambda_{ij}^{\max} = Y_{ij}$`
- Compute the Poisson log-likelihood `$\ell_k = \log \mathbb{P}(\mathbf{Y}; \lambda^{(k)})$`

### Pseudo R²

`$$R_k^2 = \frac{\ell_k - \ell_{\min}}{\ell_{\max} - \ell_{\min}}$$`

---
# A PCA analysis of the scRNA data set (1)

```r
PCA_scRNA <- PLNPCA(counts ~ 1 + offset(log(total_counts)), data = scRNA_train,
                    ranks = c(1, 2, seq(5, 40, 5)))
```

### Model Selection
<img src="slides_files/figure-html/PCA offset vizu cell-line-1.png" style="display: block; margin: auto;" />

---
# A PCA analysis of the scRNA data set  (2)

### Biplot

Individual + Factor map - 40 most contributing genes

---
class: inverse, center, middle

# On-going works

---
# Optimisation, Algorithms

With .important[Bastien Batardière, Joon Kwon, Julien Stoehr]

## Exact Maximization

Direct likelihood optim (SGD with Important Sampling)

## Optimization

Optimisation guarantees coupling adaptive SGD + variance reduction

---
# PLN LDA: Ongoing Extension

With .important[Nicolas Jouvin]

### Quadratic Discriminant Analysis

- Relax assumption of common covariance between groups
- Tests Linear vs Quadratic Discriminant

### Regularized Discriminant Analysis

- Ridge-like regularized Covariance
- Sparse Covariance
- Block-wise Covariance

`$\rightsquigarrow$` better assess the structure of the sub-population

---
# PLN PCA: Ongoing Extension

With .important[Nicolas Jouvin]

## Mixture of PLN PCA

- Assume several sub-population in the latent space
- Each sub-population has is represented by a different linear combinasion of the initial variables
- In the PLN setup, several variational approximation are possible

## Functional PCA

PhD of Barbara Bricout (S. Robin, S. Donnet)

Features are time points

---
# Temporal/Spatial dependencies

With .important[Stéphane Robin, Mahendra Mariadassou]

## 'In-row' dependency structure

Consider, e.g., a multivariate Gaussian auto-regressive process on the latent variable `$\mathbf{Z}$`

`$$\mathbf{Z}_1 \sim \mathcal{N}(0, {\boldsymbol\Sigma}), \quad \mathbf{Z}_t = \mathbf{A} \mathbf{Z}_{t-1} + {\boldsymbol\varepsilon}_t \quad
({\boldsymbol\varepsilon})_t \text{ iid } \sim \mathcal{N}(0, \Delta^{-1})$$`

The counts are described by `$(\mathbf{Y}_t) \in \mathbb{N}^p$` with time `$t\geq 0$`

`$$(\mathbf{Y}_t) \text{ indep. } \mathbf{Z}_t: \quad \mathbf{Y}_t \sim \mathcal{P}\left(\exp\left\{ \mathbf{x}_t^\top \mathbf{B} + \mathbf{Z}_t\right\}\right)$$`

`$\rightsquigarrow$` Camille Mondon's MsC

## Inference

Based on Kalman Filter in the latent layer + variational approximation

---

# A zero-inflated PLN

With .important[Bastien Batardière, Mahendra Mariadassou]

### Motivations

- account for a large amount of zero, i.e. with single-cell data,
- try to separate "true" zeros from "technical"/dropouts

### The Model

Use two latent vectors `$\mathbf{W}_i$` and `$\mathbf{Z}_i$` to model excess of zeroes and dependence structure 
`$$\begin{aligned}
   \mathbf{Z}_i & \sim \mathcal{N}({\mathbf{o}_i + \mathbf{x}_i^\top\mathbf{B}},\boldsymbol\Sigma) \\
    W_{ij} & \sim \mathcal{B}(\text{logit}^{-1}(\mathbf{x}_i^\top{\mathbf{B}}_j^0)) \\
    Y_{ij} \, | \, W_{ij}, Z_{ij} & \sim W_{ij}\delta_0 + (1-W_{ij}) \mathcal{P}\left(\exp\{Z_{ij}\}\right), \\
\end{aligned}$$`
  
The unkwown parameters are 
- `$\mathbf{B}$`, the regression parameters (from the PLN component)
- `$\mathbf{B}^0$`, the regression parameters (from the Bernoulli component)
- `$\boldsymbol\Sigma$`, the variance-covariance matrix

`$\rightsquigarrow$` ZI-PLN is a mixture of PLN and Bernoulli distribution with shared covariates.

---

# ZI-PLN Inference

### Variational approximation 1.

`\begin{equation*}
p(\mathbf{Z}_i, \mathbf{W}_i  \mathbf{Y}_i) \approx q_{\psi}(\mathbf{Z}_i, \mathbf{W}_i) \approx q_{\psi_1}(\mathbf{Z}_i) q_{\psi_2}(\mathbf{W}_i)
\end{equation*}`

with

`\begin{equation*}
q_{\psi_1}(\mathbf{Z}_i) = \mathcal{N}(\mathbf{Z}_i; \mathbf{m}_{i}, \mathrm{diag}(\mathbf{s}_{i} \circ \mathbf{s}_{i})), \qquad q_{\psi_2}(\mathbf{W}_i) = \otimes_{j=1}^p \mathcal{B}(W_{ij}, \pi_{ij})
\end{equation*}`

.important[Tested, works _partially_] Too rough variational approximation

### Variational approximation 1.

`\begin{equation*}
p(\mathbf{Z}_i, \mathbf{W}_i  \mathbf{Y}_i) \approx q_{\psi}(\mathbf{Z}_i, \mathbf{W}_i) \approx q_{\psi_1}(\mathbf{Z}_i|\mathbf{W}_i) q_{\psi_2}(\mathbf{W}_i)
\end{equation*}`

`\begin{equation*}
q_{\psi_1}(\mathbf{Z}_i | \mathbf{W}_i) = \text{ mixture of Gaussians}, \quad  q_{\psi_2}(\mathbf{W}_i) = \otimes_{j=1}^p \mathcal{B}(W_{ij}, \pi_{ij})
\end{equation*}`

---
# Conclusion

## Summary

- PLN = generic model for multivariate count data analysis
  - Flexible modeling of the covariance structure, allows for covariates
  - Efficient V-EM algorithm

## Advertisement

[https://computo.sfds.asso.fr](https://computo.sfds.asso.fr), a journal promoting reproducible research in ML and stat.

---
# References

<small>
Aitchison, J. and C. Ho (1989). "The multivariate Poisson-log normal
distribution". In: _Biometrika_ 76.4, pp. 643-653.

Chiquet, J., M. Mariadassou, and S. Robin (2018). "Variational
inference for probabilistic Poisson PCA". In: _The Annals of Applied
Statistics_ 12, pp. 2674-2698. URL:
[http://dx.doi.org/10.1214/18-AOAS1177](http://dx.doi.org/10.1214/18-AOAS1177).

Chiquet, J., M. Mariadassou, and S. Robin (2019). "Variational
inference for sparse network reconstruction from count data". In:
_Proceedings of the 19th International Conference on Machine Learning
(ICML 2019)_.

Chiquet, J., M. Mariadassou, and S. Robin (2021). "The
Poisson-Lognormal Model as a Versatile Framework for the Joint Analysis
of Species Abundances". In: _Frontiers in Ecology and Evolution_ 9.
DOI:
[10.3389/fevo.2021.588292](https://doi.org/10.3389%2Ffevo.2021.588292).

Facon, B., A. Hafsi, M. C. de la Masselière, et al. (2021). "Joint
species distributions reveal the combined effects of host plants,
abiotic factors and species competition as drivers of species
abundances in fruit flies". In: _Ecological Letters_. DOI:
[10.1111/ele.13825](https://doi.org/10.1111%2Fele.13825).

Falbel, D. and J. Luraschi (2022). _torch: Tensors and Neural Networks
with 'GPU' Acceleration_. https://torch.mlverse.org/docs,
https://github.com/mlverse/torch.

Inouye, D. I., E. Yang, G. I. Allen, et al. (2017). "A review of
multivariate distributions for count data derived from the Poisson
distribution". In: _Wiley Interdisciplinary Reviews: Computational
Statistics_ 9.3.

Johnson, S. G. (2011). _The NLopt nonlinear-optimization package_. URL:
[http://ab-initio.mit.edu/nlopt](http://ab-initio.mit.edu/nlopt).

Lejal, E., J. Chiquet, J. Aubert, et al. (2021). "Temporal patterns in
Ixodes ricinus microbial communities: an insight into tick-borne
microbe interactions". In: _Microbiome_ 9.153. DOI:
[10.1186/s40168-021-01051-8](https://doi.org/10.1186%2Fs40168-021-01051-8).

Mardia, K., J. Kent, and J. Bibby (1979). _Multivariate analysis_.
Academic press.

Paszke, A., S. Gross, S. Chintala, et al. (2017). "Automatic
differentiation in pytorch".

Riedmiller, M. and H. Braun (1993). "A direct adaptive method for
faster backpropagation learning: The RPROP algorithm". In: _IEEE
international conference on neural networks_. IEEE. , pp. 586-591.

Svanberg, K. (2002). "A class of globally convergent optimization
methods based on conservative convex separable approximations". In:
_SIAM journal on optimization_ 12.2, pp. 555-573.

Wainwright, M. J. and M. I. Jordan (2008). "Graphical Models,
Exponential Families, and Variational Inference". In: _Found. Trends
Mach. Learn._ 1.1-2, pp. 1-305.
</small>