`vignettes/PLNPCA.Rmd`

`PLNPCA.Rmd`

This vignette illustrates the standard use of the `PLNPCA`

function and the methods accompanying the R6 Classes
`PLNPCAfamily`

and `PLNPCAfit`

.

The packages required for the analysis are **PLNmodels**
plus some others for data manipulation and representation:

The main function `PLNPCA`

integrates some features of the
**future** package to perform parallel computing: you can
set your plan now to speed the fit by relying on 2 workers as
follows:

We illustrate our point with the trichoptera data set, a full description of which can be found in the corresponding vignette. Data preparation is also detailed in the specific vignette.

```
data(trichoptera)
trichoptera <- prepare_data(trichoptera$Abundance, trichoptera$Covariate)
```

The `trichoptera`

data frame stores a matrix of counts
(`trichoptera$Abundance`

), a matrix of offsets
(`trichoptera$Offset`

) and some vectors of covariates
(`trichoptera$Wind`

, `trichoptera$Temperature`

,
etc.)

In the vein of Tipping and Bishop (1999), we introduce in Chiquet, Mariadassou, and Robin (2018) a probabilistic PCA model for multivariate count data which is a variant of the Poisson Lognormal model of Aitchison and Ho (1989) (see the PLN vignette as a reminder). Indeed, it can be viewed as a PLN model with an additional rank constraint on the covariance matrix $\boldsymbol\Sigma$ such that $\mathrm{rank}(\boldsymbol\Sigma)= q$.

This PLN-PCA model can be written in a hierarchical framework where a sample of $p$-dimensional observation vectors $\mathbf{Y}_i$ is related to some $q$-dimensional vectors of latent variables $\mathbf{W}_i$ as follows: $\begin{equation} \begin{array}{rcl} \text{latent space } & \mathbf{W}_i \quad \text{i.i.d.} & \mathbf{W}_i \sim \mathcal{N}(\mathbf{0}_q, \mathbf{I}_q) \\ \text{parameter space } & \mathbf{Z}_i = {\boldsymbol\mu} + \mathbf{C}^\top \mathbf{W}_i & \\ \text{observation space } & Y_{ij} | Z_{ij} \quad \text{indep.} & Y_{ij} | Z_{ij} \sim \mathcal{P}\left(\exp\{Z_{ij}\}\right) \end{array} \end{equation}$

The parameter ${\boldsymbol\mu}\in\mathbb{R}^p$ corresponds to the main effects, the $p\times q$ matrix $\mathbf{C}$ to the loadings in the parameter spaces and $\mathbf{W}_i$ to the scores of the $i$-th observation in the low-dimensional latent subspace of the parameter space. The dimension of the latent space $q$ corresponds to the number of axes in the PCA or, in other words, to the rank of $\mathbf{C}\mathbf{C}^\intercal$. An hopefully more intuitive way of writing this model is the following: $\begin{equation} \begin{array}{rcl} \text{latent space } & \mathbf{Z}_i \sim \mathcal{N}({\boldsymbol\mu},\boldsymbol\Sigma), \qquad \boldsymbol\Sigma = \mathbf{C}\mathbf{C}^\top \\ \text{observation space } & Y_{ij} | Z_{ij} \quad \text{indep.} & Y_{ij} | Z_{ij} \sim \mathcal{P}\left(\exp\{Z_{ij}\}\right), \end{array} \end{equation}$ where the interpretation of PLN-PCA as a rank-constrained PLN model is more obvious.

Just like PLN, PLN-PCA generalizes to a formulation close to a multivariate generalized linear model where the main effect is due to a linear combination of $d$ covariates $\mathbf{x}_i$ and to a vector $\mathbf{o}_i$ of $p$ offsets in sample $i$. The latent layer then reads $\begin{equation} \mathbf{Z}_i \sim \mathcal{N}({\mathbf{o}_i + \mathbf{x}_i^\top\mathbf{B}},\boldsymbol\Sigma), \qquad \boldsymbol\Sigma = \mathbf{C}\mathbf{C}^\top, \end{equation}$ where $\mathbf{B}$ is a $d\times p$ matrix of regression parameters.

Dimension reduction and visualization is the main objective in (PLN)-PCA. To reach this goal, we need to first estimate the model parameters. Inference in PLN-PCA focuses on the regression parameters $\mathbf{B}$ and on the covariance matrix $\boldsymbol\Sigma$. Technically speaking, we adopt a variational strategy to approximate the log-likelihood function and optimize the consecutive variational surrogate of the log-likelihood with a gradient-ascent-based approach. To this end, we rely on the CCSA algorithm of Svanberg (2002) implemented in the C++ library (Johnson 2011), which we link to the package. Technical details can be found in Chiquet, Mariadassou, and Robin (2018).

In the package, the PLNPCA model is adjusted with the function
`PLNPCA`

, which we review in this section. This function
adjusts the model for a series of value of
$q$
and provides a collection of objects `PLNPCAfit`

stored in an
object with class `PLNPCAfamily`

.

The class `PLNPCAfit`

inherits from the class
`PLNfit`

, so we strongly recommend the reader to be
comfortable with `PLN`

and `PLNfit`

before using
`PLNPCA`

(see the PLN vignette).

We fit a collection of $q$ models as follows:

```
##
## Initialization...
##
## Adjusting 4 PLN models for PCA analysis.
## Rank approximation = 4 Rank approximation = 3 Rank approximation = 1 Rank approximation = 2
## Post-treatments
## DONE!
```

Note the use of the `formula`

object to specify the model,
similar to the one used in the function `PLN`

.

`PLNPCAfamily`

The `PCA_models`

variable is an `R6`

object
with class `PLNPCAfamily`

, which comes with a couple of
methods. The most basic is the `show/print`

method, which
sends a brief summary of the estimation process:

`PCA_models`

```
## --------------------------------------------------------
## COLLECTION OF 4 POISSON LOGNORMAL MODELS
## --------------------------------------------------------
## Task: Principal Component Analysis
## ========================================================
## - Ranks considered: from 1 to 4
## - Best model (greater BIC): rank = 4
## - Best model (greater ICL): rank = 3
```

One can also easily access the successive values of the criteria in the collection

param | nb_param | loglik | BIC | ICL |
---|---|---|---|---|

1 | 34 | -1042.0745 | -1108.2355 | -1120.0924 |

2 | 50 | -730.5558 | -827.8513 | -860.3039 |

3 | 65 | -641.3825 | -767.8667 | -824.3415 |

4 | 79 | -596.9058 | -750.6327 | -838.8095 |

A quick diagnostic of the optimization process is available via the
`convergence`

field:

param | nb_param | status | backend | objective | iterations | |
---|---|---|---|---|---|---|

out | 1 | 34 | 3 | nlopt | -36014.24 | 589 |

elt | 2 | 50 | 3 | nlopt | -36325.75 | 1523 |

elt.1 | 3 | 65 | 3 | nlopt | -36414.93 | 1756 |

elt.2 | 4 | 79 | 3 | nlopt | -36459.4 | 3994 |

Comprehensive information about `PLNPCAfamily`

is
available via `?PLNPCAfamily`

.

The `plot`

method of `PLNPCAfamily`

displays
evolution of the criteria mentioned above, and is a good starting point
for model selection:

`plot(PCA_models)`

Note that we use the original definition of the BIC/ICL criterion
($\texttt{loglik} - \frac{1}{2}\texttt{pen}$),
which is on the same scale as the log-likelihood. A popular
alternative consists in using
$-2\texttt{loglik} + \texttt{pen}$
instead. You can do so by specifying `reverse = TRUE`

:

`plot(PCA_models, reverse = TRUE)`

In this case, the variational lower bound of the log-likelihood is
hopefully strictly increasing (or rather decreasing if using
`reverse = TRUE`

) with the number of axes (or subspace
dimension). Also note the (approximated)
$R^2$
which is displayed for each value of
$q$
(see (Chiquet, Mariadassou,
and Robin 2018) for details on its computation).

From this plot, we can see that the best model in terms of BIC or ICL
is obtained for a rank
$q=4$
or
$q=3$.
We may extract the corresponding model with the method
`getBestModel("ICL")`

. A model with a specific rank can be
extracted with the `getModel()`

method:

```
myPCA_ICL <- getBestModel(PCA_models, "ICL")
myPCA_BIC <- getModel(PCA_models, 3) # getBestModel(PCA_models, "BIC") is equivalent here
```

`PLNPCAfit`

Objects `myPCA_ICL`

and `myPCA_BIC`

are
`R6Class`

objects of class `PLNPCAfit`

which in
turns own a couple of methods, some inherited from `PLNfit`

and some others specific, mostly for visualization purposes. The
`plot`

method provides individual maps and correlation
circles as in usual PCA. If an additional classification exists for the
observations β which is the case here with the available classification
of the trapping nights β , it can be passed as an argument to the
function.^{1}

`plot(myPCA_ICL, ind_cols = trichoptera$Group)`

Among other fields and methods (see `?PLNPCAfit`

for a
comprehensive view), the most interesting for the end-user in the
context of PCA are

- the regression coefficient matrix

Che | Hyc | Hym | Hys | Psy | Aga | Glo | Ath | Cea | Ced | Set | All | Han | Hfo | Hsp | Hve | Sta | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

(Intercept) | -7.502918 | -8.095747 | -3.023816 | -6.862104 | -0.5339895 | -3.820183 | -6.420115 | -5.826587 | -7.333493 | -3.520042 | -4.115511 | -5.048393 | -4.313785 | -5.937578 | -3.971213 | -7.423077 | -2.582003 |

- the estimated covariance matrix $\boldsymbol\Sigma$ with fixed rank

- the rotation matrix (in the latent space)

PC1 | PC2 | PC3 | |
---|---|---|---|

Che | -0.2044582 | 0.3293331 | 0.0555773 |

Hyc | -0.4335069 | -0.2689133 | 0.1455259 |

Hym | -0.1219451 | 0.0850126 | 0.3327006 |

Hys | -0.4125562 | 0.2713408 | 0.3861607 |

Psy | 0.0502550 | 0.0344179 | -0.0631976 |

Aga | 0.0742223 | 0.2190798 | 0.2731309 |

- the principal components values (or scores)

PC1 | PC2 | PC3 |
---|---|---|

-1.679062 | -0.6613174 | 0.6491076 |

3.706858 | 0.2595393 | 2.2288367 |

7.376533 | -1.2673287 | 0.0801102 |

6.220114 | -0.8605575 | -2.1642434 |

4.590906 | 0.2680092 | -1.1274612 |

4.084722 | 0.6253345 | 0.5626871 |

`PLNPCAfit`

also inherits from the methods of
`PLNfit`

(see the appropriate
vignette). Most are recalled via the show method:

`myPCA_ICL`

```
## Poisson Lognormal with rank constrained for PCA (rank = 3)
## ==================================================================
## nb_param loglik BIC ICL
## 65 -641.383 -767.867 -824.341
## ==================================================================
## * Useful fields
## $model_par, $latent, $latent_pos, $var_par, $optim_par
## $loglik, $BIC, $ICL, $loglik_vec, $nb_param, $criteria
## * Useful S3 methods
## print(), coef(), sigma(), vcov(), fitted()
## predict(), predict_cond(), standard_error()
## * Additional fields for PCA
## $percent_var, $corr_circle, $scores, $rotation, $eig, $var, $ind
## * Additional S3 methods for PCA
## plot.PLNPCAfit()
```

We provide simple plotting functions but a wealth of plotting
utilities are available for factorial analyses results. The following
bindings allow you to use widely popular tools to make your own plots:
`$eig`

, `$var`

and `$ind`

.

```
## All summaries associated to the individuals
str(myPCA_ICL$ind)
```

```
## List of 4
## $ coord : num [1:49, 1:3] -1.68 3.71 7.38 6.22 4.59 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:49] "1" "2" "3" "4" ...
## .. ..$ : chr [1:3] "Dim.1" "Dim.2" "Dim.3"
## $ cos2 : num [1:49, 1:3] 0.767 0.732 0.971 0.877 0.94 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:49] "1" "2" "3" "4" ...
## .. ..$ : chr [1:3] "Dim.1" "Dim.2" "Dim.3"
## $ contrib: num [1:49, 1:3] 0.564 2.75 10.891 7.744 4.218 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:49] "1" "2" "3" "4" ...
## .. ..$ : chr [1:3] "Dim.1" "Dim.2" "Dim.3"
## $ dist : Named num [1:49] 1.92 4.33 7.49 6.64 4.73 ...
## ..- attr(*, "names")= chr [1:49] "1" "2" "3" "4" ...
```

```
## Coordinates of the individuals in the principal plane
head(myPCA_ICL$ind$coord)
```

```
## Dim.1 Dim.2 Dim.3
## 1 -1.679062 -0.6613174 0.64910757
## 2 3.706858 0.2595393 2.22883672
## 3 7.376533 -1.2673287 0.08011023
## 4 6.220114 -0.8605575 -2.16424343
## 5 4.590906 0.2680092 -1.12746117
## 6 4.084722 0.6253345 0.56268706
```

You can also use high level functions from the factoextra package to extract relevant informations

```
## Eigenvalues
factoextra::get_eig(myPCA_ICL)
```

```
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 499.6388 39.08384 39.08384
## Dim.2 293.1372 22.93042 62.01426
## Dim.3 234.5906 18.35066 80.36492
```

```
## Variables
factoextra::get_pca_var(myPCA_ICL)
```

```
## Principal Component Analysis Results for variables
## ===================================================
## Name Description
## 1 "$coord" "Coordinates for the variables"
## 2 "$cor" "Correlations between variables and dimensions"
## 3 "$cos2" "Cos2 for the variables"
## 4 "$contrib" "contributions of the variables"
```

```
## Individuals
factoextra::get_pca_ind(myPCA_ICL)
```

```
## Principal Component Analysis Results for individuals
## ===================================================
## Name Description
## 1 "$coord" "Coordinates for the individuals"
## 2 "$cos2" "Cos2 for the individuals"
## 3 "$contrib" "contributions of the individuals"
```

And some of the very nice plotting methods such as biplots, correlation circles and scatter plots of the scores.

`factoextra::fviz_pca_biplot(myPCA_ICL)`

`factoextra::fviz_pca_var(myPCA_ICL)`

`factoextra::fviz_pca_ind(myPCA_ICL)`

You can project new data in the PCA space although itβs slightly
involved at the moment. We demonstrate that by projecting the original
data on top of the original graph. As expected, the projections of the
*new* data points (small red points) are superimposed to the
original data points (large black points).

```
## Project newdata into PCA space
new_scores <- myPCA_ICL$project(newdata = trichoptera)
## Overprint
p <- factoextra::fviz_pca_ind(myPCA_ICL, geom = "point", col.ind = "black")
factoextra::fviz_add(p, new_scores, geom = "point", color = "red",
addlabel = FALSE, pointsize = 0.5)
```

A contribution of PLN-PCA is to let the possibility to taking into account some covariates in the parameter space. Such a strategy often completely changes the interpretation of PCA. Indeed, the covariates are often responsible for some strong structure in the data. The effect of the covariates should be removed since they are often quite obvious for the analyst and may hide some more important and subtle effects.

In the case at hand, the covariates corresponds to the meteorological variables. Let us try to introduce some of them in our model, for instance, the temperature, the wind and the cloudiness. This can be done thanks to the model formula:

```
PCA_models_cov <-
PLNPCA(
Abundance ~ 1 + offset(log(Offset)) + Temperature + Wind + Cloudiness,
data = trichoptera,
ranks = 1:4
)
```

```
##
## Initialization...
##
## Adjusting 4 PLN models for PCA analysis.
## Rank approximation = 1 Rank approximation = 2 Rank approximation = 4 Rank approximation = 3
## Post-treatments
## DONE!
```

Again, the best model is obtained for $q=3$ classes.

`plot(PCA_models_cov)`

`myPCA_cov <- getBestModel(PCA_models_cov, "ICL")`

Suppose that we want to have a closer look to the first two axes. This can be done thanks to the plot method:

```
gridExtra::grid.arrange(
plot(myPCA_cov, map = "individual", ind_cols = trichoptera$Group, plot = FALSE),
plot(myPCA_cov, map = "variable", plot = FALSE),
ncol = 2
)
```

We can check that the fitted value of the counts β even with this low-rank covariance matrix β are close to the observed ones:

```
data.frame(
fitted = as.vector(fitted(myPCA_cov)),
observed = as.vector(trichoptera$Abundance)
) %>%
ggplot(aes(x = observed, y = fitted)) +
geom_point(size = .5, alpha =.25 ) +
scale_x_log10(limits = c(1,1000)) +
scale_y_log10(limits = c(1,1000)) +
theme_bw() + annotation_logticks()
```

When you are done, do not forget to get back to the standard
sequential plan with *future*.

`future::plan("sequential")`

Aitchison, J., and C. H. Ho. 1989. βThe Multivariate Poisson-Log
Normal Distribution.β *Biometrika* 76 (4): 643β53.

Chiquet, Julien, Mahendra Mariadassou, and StΓ©phane Robin. 2018.
βVariational Inference for Probabilistic Poisson PCA.β
*The Annals of Applied Statistics* 12: 2674β98.

Johnson, Steven G. 2011. *The NLopt Nonlinear-Optimization
Package*. https://nlopt.readthedocs.io/en/latest/.

Svanberg, Krister. 2002. βA Class of Globally Convergent
Optimization Methods Based on Conservative Convex Separable
Approximations.β *SIAM Journal on Optimization* 12 (2):
555β73.

Tipping, M. E, and C. M Bishop. 1999. βProbabilistic Principal
Component Analysis.β *Journal of the Royal Statistical
Society: Series B (Statistical Methodology)* 61 (3): 611β22.

With our PLN-PCA (and any pPCA model for count data, where successive models are not nested), it is important to performed the model selection of $q$ prior to visualization, since the model with rank $q=3$ is not nested in the model with rank $q=4$. Hence, percentage of variance must be interpreted with care: it sums to 100% but must be put in perspective with the model $R^2$, giving an approximation of the total percentage of variance explained with the current model.β©οΈ