`vignettes/PLNLDA.Rmd`

`PLNLDA.Rmd`

This vignette illustrates the standard use of the `PLNLDA`

function and the methods accompanying the R6 Classes `PLNLDA`

and `PLNLDAfit`

.

The packages required for the analysis are **PLNmodels**
plus some others for data manipulation and representation:

We illustrate our point with the trichoptera data set, a full description of which can be found in the corresponding vignette. Data preparation is also detailed in the specific vignette.

```
data(trichoptera)
trichoptera <- prepare_data(trichoptera$Abundance, trichoptera$Covariate)
```

The `trichoptera`

data frame stores a matrix of counts
(`trichoptera$Abundance`

), a matrix of offsets
(`trichoptera$Offset`

) and some vectors of covariates
(`trichoptera$Wind`

, `trichoptera$Temperature`

,
etc.) In the following, we’re particularly interested in the
`trichoptera$Group`

**discrete** covariate which
corresponds to disjoint time spans during which the catching took place.
The correspondence between group label and time spans is:

Label | Number.of.Consecutive.Nights | Date |
---|---|---|

1 | 12 | June 59 |

2 | 5 | June 59 |

3 | 5 | June 59 |

4 | 4 | June 59 |

5 | 4 | July 59 |

6 | 1 | June 59 |

7 | 3 | June 60 |

8 | 4 | June 60 |

9 | 5 | June 60 |

10 | 4 | June 60 |

11 | 1 | June 60 |

12 | 1 | July 60 |

In the vein of Fisher (1936) and Rao (1948), we introduce a multi-class LDA model for multivariate count data which is a variant of the Poisson Lognormal model of Aitchison and Ho (1989) (see the PLN vignette as a reminder). Indeed, it can viewed as a PLN model with a discrete group structure in the latent Gaussian space.

This PLN-LDA model can be written in a hierarchical framework where a sample of \(p\)-dimensional observation vectors \(\mathbf{Y}_i\) is related to some \(p\)-dimensional vectors of latent variables \(\mathbf{Z}_i\) and a discrete structure with \(K\) groups in the following way: \[\begin{equation} \begin{array}{rcl} \text{group structure } & \mathbf{\mu}_i = \mu_{g_i} & g_i \in \{1, \dots, K\} \\ \text{latent space } & \mathbf{Z}_i \quad \text{indep.} & \mathbf{Z}_i \sim \mathcal{N}({\boldsymbol\mu}_i, \boldsymbol{\Sigma}) & \\ \text{observation space } & Y_{ij} | Z_{ij} \quad \text{indep.} & Y_{ij} | Z_{ij} \sim \mathcal{P}\left(\exp\{Z_{ij}\}\right) \end{array} \end{equation}\] where \(g_i\) denotes the group sample \(i\) belongs to.

The different parameters \({\boldsymbol\mu}_k \in\mathbb{R}^p\) corresponds to the group-specific main effects and the variance matrix \(\boldsymbol{\Sigma}\) is shared among groups. An equivalent way of writing this model is the following: \[\begin{equation} \begin{array}{rcl} \text{latent space } & \mathbf{Z}_i \sim \mathcal{N}({\boldsymbol\mu}_i,\boldsymbol\Sigma) & \boldsymbol{\mu}_i = \mathbf{g}_i^\top \mathbf{M} \\ \text{observation space } & Y_{ij} | Z_{ij} \quad \text{indep.} & Y_{ij} | Z_{ij} \sim \mathcal{P}\left(\exp\{Z_{ij}\}\right), \end{array} \end{equation}\] where, with a slight abuse of notation, \(\mathbf{g}_i\) is a group-indicator vector of length \(K\) (\(g_{ik} = 1 \Leftrightarrow g_i = k\)) and \(\mathbf{M} = [\boldsymbol{\mu}_1^\top, \dots, \boldsymbol{\mu}_K^\top]^\top\) is a \(K \times p\) matrix collecting the group-specific main effects.

Just like PLN, PLN-LDA generalizes to a formulation close to a multivariate generalized linear model where the main effect is due to a linear combination of the discrete group structure, \(d\) covariates \(\mathbf{x}_i\) and a vector \(\mathbf{o}_i\) of \(p\) offsets in sample \(i\). The latent layer then reads \[\begin{equation} \mathbf{Z}_i \sim \mathcal{N}({\mathbf{o}_i + \mathbf{g}_i^\top \mathbf{M} + \mathbf{x}_i^\top\boldsymbol\Theta},\boldsymbol\Sigma) \end{equation}\] where \(\boldsymbol\Theta\) is a \(d\times p\) matrix of regression parameters.

Given:

- a new observation \(\mathbf{Y}\) with associated offset \(\mathbf{o}\) and covariates \(\mathbf{x}\)
- a model with estimated parameters \(\hat{\boldsymbol{\Sigma}}\), \(\hat{\boldsymbol{\Theta}}\), \(\hat{\mathbf{M}}\) and group counts \((n_1, \dots, n_K)\)

We can predict the observation’s group using Bayes rule as follows: for \(k \in {1, \dots, K}\), compute \[\begin{equation} \begin{aligned} f_k(\mathbf{Y}) & = p(\mathbf{Y} | \mathbf{g} = k, \mathbf{o}, \mathbf{x}, \hat{\boldsymbol{\Theta}}, \hat{\boldsymbol{\Sigma}}) \\ & = \boldsymbol{\Phi}_{PLN}(\mathbf{Y}; \mathbf{o} + \boldsymbol{\mu}_k + \mathbf{x}^\top \hat{\boldsymbol{\Theta}}, \hat{\boldsymbol{\Sigma}}) \\ p_k & = \frac{n_k}{\sum_{k' = 1}^K n_{k'}} \end{aligned} \end{equation}\] where \(\boldsymbol{\Phi}_{PLN}(\bullet; \boldsymbol{\mu}, \boldsymbol{\Sigma})\) is the density function of a PLN distribution with parameters \((\boldsymbol{\mu}, \boldsymbol{\Sigma})\). \(f_k(\mathbf{Y})\) and \(p_k\) are respectively plug-in estimates of (i) the probability of observing counts \(\mathbf{Y}\) in a sample from group \(k\) and (ii) the probability that a sample originates from group \(k\).

The posterior probability \(\hat{\pi}_k(\mathbf{Y})\) that observation \(\mathbf{Y}\) belongs to group \(k\) and most likely group \(\hat{k}(\mathbf{Y})\) can thus be defined as \[\begin{equation} \begin{aligned} \hat{\pi}_k(\mathbf{Y}) & = \frac{p_k f_k(\mathbf{Y})}{\sum_{k' = 1}^K p_{k'} f_{k'}(\mathbf{Y})} \\ \hat{k}(\mathbf{Y}) & = \underset{k \in \{1, \dots, K\}}{\arg\max} \hat{\pi}_k(\mathbf{Y}) \end{aligned} \end{equation}\]

Classification and prediction are the main objectives in (PLN-)LDA.
To reach this goal, we first need to estimate the model parameters.
Inference in PLN-LDA focuses on the group-specific main effects \(\mathbf{M}\), the regression parameters
\(\boldsymbol\Theta\) and the
covariance matrix \(\boldsymbol\Sigma\). Technically speaking,
we can treat \(\mathbf{g}_i\) as a
discrete covariate and estimate \([\mathbf{M},
\boldsymbol{\Theta}]\) using the same strategy as for the
standard **PLN** model. Briefly, we adopt a variational
strategy to approximate the log-likelihood function and optimize the
consecutive variational surrogate of the log-likelihood with a
gradient-ascent-based approach. To this end, we rely on the CCSA
algorithm of Svanberg (2002) implemented in the C++ library
(Johnson 2011), which we link to the
package.

In the package, the PLN-LDA model is adjusted with the function
`PLNLDA`

, which we review in this section. This function
adjusts the model and stores it in a object of class
`PLNLDAfit`

which inherits from the class
`PLNfit`

, so we strongly recommend the reader to be somehow
comfortable with `PLN`

and `PLNfit`

before using
`PLNLDA`

(see the PLN vignette).

We start by adjusting the above model to the Trichoptera data set. We
use `Group`

, the catching time spans, as a discrete structure
and use log as an offset to capture differences in sampling luck.

The model can be fitted with the function `PLNLDA`

as
follows:

```
##
## Performing discriminant Analysis...
## DONE!
```

Note that `PLNLDA`

uses the standard `formula`

interface, like every other model in the **PLNmodels**
package.

`PLNLDAfit`

The `myLDA_nocov`

variable is an `R6`

object
with class `PLNLDAfit`

, which comes with a couple of methods.
The most basic is the `show/print`

method, which sends a
brief summary of the estimation process and available methods:

`myLDA_nocov`

```
## Linear Discriminant Analysis for Poisson Lognormal distribution
## ==================================================================
## nb_param loglik BIC ICL
## 391 -800.028 -1560.879 -1225.04
## ==================================================================
## * Useful fields
## $model_par, $latent, $latent_pos, $var_par, $optim_par
## $loglik, $BIC, $ICL, $loglik_vec, $nb_param, $criteria
## * Useful S3 methods
## print(), coef(), sigma(), vcov(), fitted()
## predict(), predict_cond(), standard_error()
## * Additional fields for LDA
## $percent_var, $corr_map, $scores, $group_means
## * Additional S3 methods for LDA
## plot.PLNLDAfit(), predict.PLNLDAfit()
```

Comprehensive information about `PLNLDAfit`

is available
via `?PLNLDAfit`

.

The user can easily access several fields of the
`PLNLDAfit`

object using `S3`

methods, the most
interesting ones are

- the \(p \times p\) covariance matrix \(\hat{\boldsymbol{\Sigma}}\):

- the regression coefficient matrix \(\hat{\boldsymbol{\Theta}}\) (in this case
`NULL`

as there are no covariates)

`coef(myLDA_nocov)`

`## NULL`

- the \(p \times K\) matrix of group means \(\mathbf{M}\)

grouping1 | grouping2 | grouping3 | grouping4 | grouping5 | grouping6 | grouping7 | grouping8 | grouping9 | grouping10 | grouping11 | grouping12 |
---|---|---|---|---|---|---|---|---|---|---|---|

-21.61 | -6.82 | -21.18 | -20.00 | -22.96 | -23.54 | -21.85 | -20.99 | -21.06 | -5.67 | -3.48 | -19.99 |

-21.58 | -22.32 | -5.68 | -19.97 | -7.45 | -23.50 | -21.81 | -20.96 | -21.03 | -21.10 | -18.94 | -4.48 |

-2.38 | -3.96 | -2.36 | -2.92 | -6.48 | -5.54 | -2.69 | -2.03 | -2.71 | -2.65 | -19.04 | -3.85 |

-21.59 | -22.34 | -5.69 | -4.52 | -22.86 | -23.57 | -21.85 | -5.52 | -5.59 | -4.91 | -19.00 | -20.02 |

-0.28 | -0.26 | -0.62 | -1.09 | -0.62 | -0.11 | -0.66 | -0.80 | -0.39 | -0.45 | -0.61 | -1.16 |

-4.00 | -3.32 | -2.93 | -4.47 | -6.72 | -8.00 | -2.85 | -3.06 | -5.53 | -3.99 | -18.91 | -19.93 |

The `PLNLDAfit`

class also benefits from two important
methods: `plot`

and `predict`

.

`plot`

method
The `plot`

methods provides easy to interpret graphics
which reveals here that the groups are well separated:

`plot(myLDA_nocov)`

By default, `plot`

shows the first 3 axis of the LDA when
there are 4 or more groups and uses special representations for the edge
cases of 3 or less groups.

`ggplot2`

-savvy users who want to make their own
representations can extracts the \(n \times
(K-1)\) matrix of sample scores from the `PLNLDAfit`

object …

LD1 | LD2 | LD3 | LD4 | LD5 | LD6 | LD7 | LD8 | LD9 | LD10 | LD11 |
---|---|---|---|---|---|---|---|---|---|---|

1141218 | -345890.7 | -19565.53 | -36438.04 | -8158.59 | 3126.82 | 3234.54 | -8.42 | 114.38 | -13.37 | -82.84 |

1142398 | -344464.1 | -19702.37 | -36728.69 | -8066.40 | 3125.09 | 3257.54 | -38.22 | 110.92 | -12.66 | -83.62 |

1153829 | -331104.0 | -21401.86 | -38662.74 | -7349.46 | 3108.75 | 3098.77 | -50.44 | 118.20 | -9.40 | -78.45 |

1166545 | -315989.7 | -23304.10 | -40733.95 | -6580.43 | 3094.36 | 2825.04 | 27.46 | 129.42 | -7.33 | -72.07 |

1159038 | -324807.4 | -22173.05 | -39485.71 | -7052.29 | 3101.76 | 2943.85 | 22.90 | 124.19 | -8.94 | -75.47 |

1149716 | -335826.4 | -20786.65 | -37948.35 | -7618.88 | 3115.49 | 3118.14 | -10.14 | 116.83 | -11.19 | -80.07 |

…or the \(p \times (K-1)\) matrix of correlations between scores and (latent) variables

LD1 | LD2 | LD3 | LD4 | LD5 | LD6 | LD7 | LD8 | LD9 | LD10 | LD11 | |
---|---|---|---|---|---|---|---|---|---|---|---|

Che | -0.35 | -0.07 | -0.37 | 0.22 | 0.56 | -0.48 | -0.20 | -0.53 | 0.83 | -0.42 | 0.06 |

Hyc | 0.08 | 0.62 | -0.13 | 0.32 | -0.11 | -0.05 | -0.73 | 0.81 | -0.22 | -0.12 | 0.61 |

Hym | 0.11 | 0.05 | 0.21 | -0.77 | -0.40 | -0.36 | 0.21 | -0.07 | -0.26 | -0.05 | -0.61 |

Hys | -0.41 | 0.21 | 0.68 | 0.00 | -0.33 | -0.61 | -0.02 | -0.12 | -0.29 | 0.01 | 0.28 |

Psy | 0.11 | -0.50 | -0.21 | -0.25 | 0.47 | -0.05 | 0.13 | -0.01 | 0.57 | -0.64 | -0.32 |

Aga | 0.13 | 0.04 | 0.18 | -0.90 | -0.11 | -0.28 | 0.01 | -0.26 | -0.05 | -0.26 | -0.50 |

`predict`

method
The `predict`

method has a slightly different behavior
than its siblings in other models of the **PLNmodels**. The
goal of `predict`

is to predict the discrete class based on
observed *species counts* (rather than predicting counts from
known covariates).

By default, the `predict`

use the argument
`type = "posterior"`

to output the matrix of log-posterior
probabilities \(\log(\hat{\pi})_k\)

```
predicted.class <- predict(myLDA_nocov, newdata = trichoptera)
## equivalent to
## predicted.class <- predict(myLDA_nocov, newdata = trichoptera, type = "posterior")
predicted.class %>% head() %>% knitr::kable(digits = 2)
```

1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
---|---|---|---|---|---|---|---|---|---|---|---|

-1562.65 | -1889.29 | -2063.48 | -1914.94 | -1439.29 | -1000.66 | -1834.58 | -1933.57 | -1616.50 | -1772.31 | -1698.72 | -1898.10 |

-800.49 | -855.07 | -1282.43 | -1628.86 | -1363.65 | -991.90 | -976.35 | -911.33 | -784.64 | -898.45 | -831.41 | -1229.11 |

-1005.71 | -1419.27 | -1461.97 | -1913.03 | -1938.65 | -1467.80 | -1107.87 | -1414.89 | -1156.64 | -1422.16 | -1250.55 | -1448.27 |

-1089.86 | -1888.38 | -2071.83 | -2220.61 | -2307.42 | -2268.98 | -1391.35 | -1734.94 | -1803.64 | -1947.21 | -1719.11 | -1840.09 |

-947.40 | -1632.49 | -1756.83 | -2163.31 | -2207.74 | -2198.50 | -1178.51 | -1456.21 | -1492.91 | -1415.87 | -1584.18 | -1753.63 |

-874.16 | -789.00 | -1171.19 | -1958.41 | -1893.49 | -1454.88 | -818.67 | -777.79 | -1039.17 | -1150.18 | -1144.87 | -1627.61 |

You can also show them in the standard (and human-friendly) \([0, 1]\) scale with
`scale = "prob"`

to get the matrix \(\hat{\pi}_k\)

```
predicted.class <- predict(myLDA_nocov, newdata = trichoptera, scale = "prob")
predicted.class %>% head() %>% knitr::kable(digits = 3)
```

1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
---|---|---|---|---|---|---|---|---|---|---|---|

0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |

0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |

1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |

Setting `type = "response"`

, we can predict the most
likely group \(\hat{k}\) instead:

```
predicted.class <- predict(myLDA_nocov, newdata = trichoptera, type = "response")
predicted.class
```

```
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
## 6 9 1 1 1 8 9 2 9 8 1 6 2 7 8 8 10 8 8 10 9 12 4 9 4 4
## 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
## 12 4 4 6 6 8 7 7 9 8 2 9 9 9 9 9 9 2 2 6 10 11 12
## Levels: 1 2 3 4 5 6 7 8 9 10 11 12
```

We can assess that the predictions are quite similar to the real
group (*this is not a proper validation of the method as we used data
set for both model fitting and prediction and are thus at risk of
overfitting*).

```
## true
## predicted 1 2 3 4 5 6 7 8 9 10 11 12
## 1 4 0 0 0 0 0 0 0 0 0 0 0
## 2 1 1 0 0 0 0 0 1 0 2 0 0
## 3 0 0 0 0 0 0 0 0 0 0 0 0
## 4 0 0 0 3 2 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0 0 0 0
## 6 2 0 0 0 1 1 0 0 0 1 0 0
## 7 0 1 0 0 0 0 2 0 0 0 0 0
## 8 2 2 2 0 0 0 1 1 0 0 0 0
## 9 3 0 1 1 0 0 0 2 5 0 0 0
## 10 0 1 1 0 0 0 0 0 0 1 0 0
## 11 0 0 0 0 0 0 0 0 0 0 1 0
## 12 0 0 1 0 1 0 0 0 0 0 0 1
```

Finally, we can get the coordinates of the new data on the same graph
at the original ones with `type = "scores"`

. This is done by
averaging the latent positions \(\hat{\mathbf{Z}}_i + \boldsymbol{\mu}_k\)
(found when the sample is assumed to come from group \(k\)) and weighting them with the \(\hat{\pi}_k\). Some samples, have
compositions that put them very far from their group mean.

```
library(ggplot2)
predicted.scores <- predict(myLDA_nocov, newdata = trichoptera, type = "scores")
colnames(predicted.scores) <- paste0("Axis.", 1:ncol(predicted.scores))
predicted.scores <- as.data.frame(predicted.scores)
predicted.scores$group <- trichoptera$Group
plot(myLDA_nocov, map = "individual", nb_axes = 2, plot = FALSE) +
geom_point(data = predicted.scores,
aes(x = Axis.1, y = Axis.2, color = group, label = NULL))
```

It is possible to correct for other covariates before finding the LDA
axes that best separate well the groups. In our case ,we’re going to use
`Wind`

as a covariate and illustrate the main differences
with before :

```
myLDA_cov <- PLNLDA(Abundance ~ Wind + 0 + offset(log(Offset)),
grouping = Group,
data = trichoptera)
```

```
##
## Performing discriminant Analysis...
## DONE!
```

All fields of our new `PLNLDA`

fit can be accessed as
before with similar results. The only important difference is the result
of `coef`

: since we included a covariate in the model,
`coef`

now returns a 1-column matrix for \(\hat{\boldsymbol{\Theta}}\) instead of
`NULL`

Wind | |
---|---|

Che | -0.3321637 |

Hyc | 1.2304346 |

Hym | -0.1930088 |

Hys | -0.4546537 |

Psy | 0.0384414 |

Aga | -0.0612139 |

The group-specific main effects can still be accessed with
`$group_means`

grouping1 | grouping2 | grouping3 | grouping4 | grouping5 | grouping6 | grouping7 | grouping8 | grouping9 | grouping10 | grouping11 | grouping12 |
---|---|---|---|---|---|---|---|---|---|---|---|

-15.27 | -9.71 | -18.20 | -14.14 | -19.35 | -33.16 | -22.54 | -21.19 | -14.65 | 1.86 | 7.70 | -17.68 |

-18.08 | -33.08 | -2.47 | -11.94 | -10.57 | -36.95 | -25.22 | -22.50 | -11.06 | -13.04 | -3.98 | 0.51 |

-1.22 | -4.40 | -1.79 | -1.84 | -5.80 | -6.80 | -2.66 | -1.84 | -1.62 | -1.61 | -18.46 | -3.33 |

-18.37 | -25.03 | -3.47 | -0.76 | -20.44 | -29.31 | -22.72 | -5.24 | -1.55 | -0.95 | -13.90 | -19.20 |

-0.05 | -0.42 | -0.47 | -0.85 | -0.47 | -0.43 | -0.60 | -0.71 | -0.06 | -0.15 | -0.18 | -1.03 |

-1.93 | -4.05 | -1.79 | -2.52 | -5.54 | -10.38 | -2.72 | -2.76 | -3.43 | -1.88 | -16.91 | -20.10 |

`plot`

method
Once again, the `plot`

method is very useful to get a
quick look at the results.

`plot(myLDA_cov)`

`predict`

method
We can again predict the most likely group for each sample :

`predicted.class_cov <- predict(myLDA_cov, newdata = trichoptera, type = "response")`

and check that we recover the correct class in most cases (again, we used the same data set for model fitting and group prediction only for ease of exposition):

```
## true
## predicted 1 2 3 4 5 6 7 8 9 10 11 12
## 1 4 0 0 0 0 0 1 0 0 0 0 0
## 2 1 1 0 1 0 0 0 1 0 1 0 0
## 3 0 0 0 0 0 0 0 0 0 0 0 0
## 4 0 0 0 2 2 0 0 0 0 0 0 0
## 5 0 0 0 1 1 0 0 0 0 0 0 0
## 6 1 0 0 0 0 1 0 0 0 0 0 0
## 7 1 1 0 0 0 0 0 0 0 1 0 0
## 8 1 2 2 0 0 0 2 2 0 0 0 0
## 9 2 1 1 0 0 0 0 1 5 1 0 0
## 10 0 0 1 0 0 0 0 0 0 1 0 0
## 11 2 0 0 0 0 0 0 0 0 0 1 0
## 12 0 0 1 0 1 0 0 0 0 0 0 1
```

Aitchison, J., and C. H. Ho. 1989. “The Multivariate Poisson-Log
Normal Distribution.” *Biometrika* 76 (4): 643–53.

Fisher, R. A. 1936. “The Use of Multiple Measurements in Taxonomic
Problems.” *Annals of Eugenics* 7 (2): 179–88.

Johnson, Steven G. 2011. *The NLopt Nonlinear-Optimization
Package*. https://nlopt.readthedocs.io/en/latest/.

Rao, C. Radhakrishna. 1948. “The Utilization of Multiple
Measurements in Problems of Biological Classification.”
*Journal of the Royal Statistical Society. Series B
(Methodological)* 10 (2): 159–203.

Svanberg, Krister. 2002. “A Class of Globally Convergent
Optimization Methods Based on Conservative Convex Separable
Approximations.” *SIAM Journal on Optimization* 12 (2):
555–73.