Skip to contents

All the models in PLNmodels are fitted by variational inference: each one maximises a tractable lower bound of an otherwise intractable log-likelihood. This document describes, model by model, which optimizer is used and, for the home-made "builtin" backend, the mathematical principles behind it.

1 The common variational framework

Every model shares the same Poisson log-normal core. For a count matrix π˜βˆˆβ„•nΓ—p\mathbf Y \in \mathbb N^{n\times p}, offsets 𝐎\mathbf O and covariates 𝐗\mathbf X, the latent Gaussian vectors 𝐙iβˆΌπ’©(𝐱i⊀𝐁,𝚺)\mathbf Z_i \sim \mathcal N(\mathbf x_i^\top\mathbf B,\, \boldsymbol\Sigma) drive Poisson emissions Yij|ZijβˆΌπ’«(exp⁑(Oij+Zij))Y_{ij}\,|\,Z_{ij}\sim\mathcal P(\exp(O_{ij}+Z_{ij})).

Because the marginal likelihood integrates out 𝐙\mathbf Z, inference maximises a variational lower bound (the ELBO) obtained with a factorised Gaussian approximation q(𝐙i)=𝒩(𝐱i⊀𝐁+𝐦i,diag(𝐬i2))q(\mathbf Z_i)=\mathcal N(\mathbf x_i^\top\mathbf B+\mathbf m_i,\ \mathrm{diag}(\mathbf s_i^2)). Writing 𝛀=πšΊβˆ’1\boldsymbol\Omega=\boldsymbol\Sigma^{-1}, 𝐀=exp⁑(𝐎+𝐗𝐁+𝐌+12𝐒2)\mathbf A=\exp(\mathbf O+\mathbf X\mathbf B+\mathbf M+\tfrac12\mathbf S^2) and 𝐙=𝐎+𝐗𝐁+𝐌\mathbf Z=\mathbf O+\mathbf X\mathbf B+\mathbf M, the bound reads

J(𝐁,𝛀,𝐌,𝐒2)=βˆ‘ij(YijZijβˆ’Aij+12log⁑Sij2)βˆ’12βˆ‘i(𝐦iβŠ€π›€π¦i+tr(diag(𝐬i2)𝛀))+n2log⁑det⁑𝛀+cst. J(\mathbf B,\boldsymbol\Omega,\mathbf M,\mathbf S^2)= \sum_{ij}\Big(Y_{ij}Z_{ij}-A_{ij}+\tfrac12\log S^2_{ij}\Big) -\tfrac12\sum_i\Big(\mathbf m_i^\top\boldsymbol\Omega\,\mathbf m_i+\mathrm{tr}(\mathrm{diag}(\mathbf s_i^2)\boldsymbol\Omega)\Big) +\tfrac{n}{2}\log\det\boldsymbol\Omega + \text{cst}.

The variational parameters are the means 𝐌\mathbf M and variances 𝐒2\mathbf S^2 (one pair per observation and per latent dimension); the model parameters are the regression coefficients 𝐁\mathbf B and the covariance structure 𝚺\boldsymbol\Sigma (or 𝛀\boldsymbol\Omega). Optimization alternates a VE-step (update 𝐌,𝐒2\mathbf M,\mathbf S^2 at fixed model parameters) with an M-step (update 𝐁,𝚺\mathbf B,\boldsymbol\Sigma), until the ELBO stabilises. The different models change what structure is imposed on 𝚺\boldsymbol\Sigma (or on the emission), not this general scheme. Throughout, the variances are optimised in the unconstrained parametrisation 𝛙=log⁑𝐒2\boldsymbol\psi=\log\mathbf S^2.

2 Which optimizer for which model

Three backends are available, selected through the backend argument of each *_param() helper:

  • nlopt β€” a first-order solver from the NLOPT library (the CCSAQ conservative convex separable approximation). Robust and cheap per iteration but needs many iterations.
  • builtin β€” the home-made second-order optimizer implemented directly in the package (the subject of Β§3). It uses the analytic Hessian of the bound.
  • torch β€” automatic differentiation with first-order optimizers (Adam, RPROP, …). Experimental; typically reaches a lower bound than the other two.
Model Imposed structure Backends Default builtin optimizer
PLN 𝚺\boldsymbol\Sigma full / diagonal / spherical / fixed / genetic nlopt, builtin, torch nlopt joint coordinate Newton (§3.1)
PLNLDA PLN with per-group means (via PLN) nlopt as PLN
PLNPCA rank-qq𝚺=π‚π‚βŠ€\boldsymbol\Sigma=\mathbf C\mathbf C^\top nlopt, builtin, torch nlopt profiled trust-region Newton (Β§3.4)
PLNnetwork sparse 𝛀\boldsymbol\Omega (β„“1\ell_1) builtin, nlopt builtin Newton VE + graphical-lasso M-step (Β§3.2)
PLNmixture mixture of KK PLN builtin, nlopt, torch builtin outer mixture-EM + per-component builtin (Β§3.5)
ZIPLN zero-inflation + 𝚺\boldsymbol\Sigma structure builtin, nlopt builtin joint Newton on (𝐌,𝛙,𝐑)(\mathbf M,\boldsymbol\psi,\mathbf R) (Β§3.3)
ZIPLNnetwork zero-inflation + sparse 𝛀\boldsymbol\Omega builtin, nlopt builtin as ZIPLN + graphical-lasso

The default is nlopt for PLN/PLNPCA (conservative, well tested) and builtin for the penalised / structured models, where the second-order steps pay off most.

3 The builtin backend: mathematical principles

The common thread is that the VE-step is concave and separable, which makes it an ideal target for Newton’s method with a cheaply invertible Hessian; the M-steps are then either closed-form or a small convex problem.

3.1 PLN and its covariance variants β€” joint coordinate Newton

At fixed (𝐁,𝛀)(\mathbf B,\boldsymbol\Omega) the bound decouples across observations, and for each observation the pair (𝐦i,𝛙i)(\mathbf m_i,\boldsymbol\psi_i) is updated by a Newton step that is joint in (m,ψ)(m,\psi) but diagonal across the pp coordinates. Per entry (i,j)(i,j) the gradient is

βˆ‚mJ=(πŒπ›€)ij+Aijβˆ’Yij,βˆ‚ΟˆJ=12(AijSij2+Ο‰jjSij2βˆ’1), \partial_{m}J = (\mathbf M\boldsymbol\Omega)_{ij}+A_{ij}-Y_{ij}, \qquad \partial_{\psi}J = \tfrac12\big(A_{ij}S^2_{ij}+\omega_{jj}S^2_{ij}-1\big),

and the 2Γ—22\times2 Hessian block, using Ο‰jj=Ξ©jj\omega_{jj}=\Omega_{jj}, is

H=(Aij+Ο‰jj12AijSij212AijSij212Sij2(Aij(1+12Sij2)+Ο‰jj)), H=\begin{pmatrix} A_{ij}+\omega_{jj} & \tfrac12 A_{ij}S^2_{ij}\\[2pt] \tfrac12 A_{ij}S^2_{ij} & \tfrac12 S^2_{ij}\big(A_{ij}(1+\tfrac12 S^2_{ij})+\omega_{jj}\big)\end{pmatrix},

so the joint step is obtained by inverting a 2Γ—22\times2 matrix in closed form (compute_joint_step_MS in src/covariance_pln.h). Solving (m,ψ)(m,\psi)together captures the coupling between the mean and its variance and converges in a handful of passes. The cross term 12AijSij2\tfrac12 A_{ij}S^2_{ij} is exactly this coupling; a separate fixed point for ψ\psi would ignore it.

The M-step is analytic. The regression coefficients are profiled out, 𝐁̂=(π—βŠ€π–π—)βˆ’1π—βŠ€π–πŒfull\widehat{\mathbf B}=(\mathbf X^\top\mathbf W\mathbf X)^{-1}\mathbf X^\top\mathbf W\,\mathbf M_{\text{full}}, and the covariance is the empirical variational covariance, πšΊΜ‚=1n(𝐌⊀𝐌+diag(βˆ‘i𝐬i2))\widehat{\boldsymbol\Sigma}=\tfrac1n\big(\mathbf M^\top\mathbf M+\mathrm{diag}(\textstyle\sum_i\mathbf s_i^2)\big), projected onto the requested structure (full, diagonal, spherical, fixed, or the genetic Οƒ2(ρ𝐂+(1βˆ’Ο)𝐈)\sigma^2(\rho\mathbf C+(1-\rho)\mathbf I) solved by 1-D bisection on ρ\rho). All variants share one templated implementation (CovTraitsBase), each supplying only its covariance-specific primitives.

3.2 PLNnetwork β€” Newton VE-step + graphical-lasso M-step

PLNnetwork adds an β„“1\ell_1 penalty on the precision 𝛀\boldsymbol\Omega to recover a sparse conditional-dependence graph. The VE-step is unchanged (Β§3.1 Newton), and the covariance M-step becomes a graphical lasso:

𝛀̂=arg⁑max𝛀≻0n2log⁑detβ‘π›€βˆ’n2tr(πšΊΜ‚π›€)βˆ’Ξ»β€–π›€β€–1, \widehat{\boldsymbol\Omega}=\arg\max_{\boldsymbol\Omega\succ0}\ \tfrac{n}{2}\log\det\boldsymbol\Omega-\tfrac{n}{2}\mathrm{tr}(\widehat{\boldsymbol\Sigma}\boldsymbol\Omega)-\lambda\lVert\boldsymbol\Omega\rVert_{1},

solved efficiently with glassoFast along the penalty path Ξ»\lambda. Successive penalties are warm-started, and the inception model is only partially converged (maxit_ve = 1, inception_niter = 5) so that the latent means do not over-fit the unpenalised optimum before the sparse grid begins.

3.3 ZIPLN β€” joint Newton on (𝐌,𝛙,𝐑)(\mathbf M,\boldsymbol\psi,\mathbf R)

Zero-inflated PLN augments the model with Bernoulli β€œexcess-zero” indicators; the variational approximation adds posterior probabilities 𝐑\mathbf R of being an excess zero. The builtin VE-step is again a Newton step, now joint in (𝐦i,𝛙i,𝐫i)(\mathbf m_i,\boldsymbol\psi_i,\mathbf r_i), with 𝐑\mathbf R updated inside the C++ solver (a closed-form logistic update given 𝐌,𝐒2\mathbf M,\mathbf S^2). The effective Poisson rate becomes (1βˆ’π‘)βŠ™π€(1-\mathbf R)\odot\mathbf A, which reuses exactly the PLN covariance machinery with the substitution 𝐀←(1βˆ’π‘)βŠ™π€\mathbf A\!\leftarrow\!(1-\mathbf R)\odot\mathbf A. The M-step updates the zero-inflation coefficients and the covariance (full/diagonal/spherical/fixed, or sparse via graphical lasso for ZIPLNnetwork).

3.4 PLNPCA β€” profiled trust-region Newton

PLNPCA constrains 𝚺=π‚π‚βŠ€\boldsymbol\Sigma=\mathbf C\mathbf C^\top to rank qq, so the latent is 𝐙=𝐎+𝐗𝐁+πŒπ‚βŠ€\mathbf Z=\mathbf O+\mathbf X\mathbf B+\mathbf M\mathbf C^\top with qq-dimensional scores πŒβˆˆβ„nΓ—q\mathbf M\in\mathbb R^{n\times q} and loadings π‚βˆˆβ„pΓ—q\mathbf C\in\mathbb R^{p\times q}. NaΓ―ve block-coordinate ascent stalls on the saddle points of this non-convex landscape; the builtin optimizer avoids them with a profiled, saddle-aware trust-region Newton.

Profiling. The score block (𝐌,𝛙)(\mathbf M,\boldsymbol\psi) is concave and is solved by a per-observation Newton VE-step (a qΓ—qq\times q solve for each 𝐦i\mathbf m_i, plus a fixed point for 𝐬i2\mathbf s_i^2). This defines the profiled objective on the loadings alone, g(𝐁,𝐂)=max𝐌,𝐒2J(𝐁,𝐂,𝐌,𝐒2). g(\mathbf B,\mathbf C)=\max_{\mathbf M,\mathbf S^2}J(\mathbf B,\mathbf C,\mathbf M,\mathbf S^2). By the envelope theorem its gradient is free β€” it is simply the partial gradient βˆ‚J/βˆ‚(𝐁,𝐂)\partial J/\partial(\mathbf B,\mathbf C) evaluated at the optimal (𝐌,𝐒2)(\mathbf M,\mathbf S^2).

Reduced (Schur) Hessian. Writing ΞΈ=(𝐁,𝐂)\theta=(\mathbf B,\mathbf C) (the loadings) and Ο†=(𝐌,𝛙)\varphi=(\mathbf M,\boldsymbol\psi) (the profiled scores), the Hessian of gg is the Schur complement Hred=JΞΈΞΈβˆ’JΞΈΟ†JΟ†Ο†βˆ’1Jφθ. H_{\text{red}}=J_{\theta\theta}-J_{\theta\varphi}\,J_{\varphi\varphi}^{-1}\,J_{\varphi\theta}. It is never formed. The inner block JφφJ_{\varphi\varphi} is block-diagonal per observation (2qΓ—2q2q\times2q, inverted analytically); the cross/outer terms are applied through analytic directional derivatives of the gradient, so a Hessian–vector product Hred𝐯H_{\text{red}}\mathbf v costs only a couple of gradient evaluations plus nn small solves.

Saddle-aware trust region. The reduced Hessian is indefinite (the landscape is saddle-rich), so gg is maximised with a trust-region method whose subproblem is solved by a preconditioned Steihaug conjugate gradient. The preconditioner is the analytic diagonal of JΞΈΞΈJ_{\theta\theta} (a Jacobi metric that absorbs the very different scales of 𝐁\mathbf B and 𝐂\mathbf C), and the conjugate gradient follows negative-curvature directions to the trust-region boundary β€” precisely the directions that let the iterate escape saddles rather than stall at them. This reaches a variational bound at least as high as nlopt across datasets, in one to two orders of magnitude fewer outer iterations.

Large data

On large problems the reduced Hessian is strongly indefinite and gg is highly non-quadratic for large steps, so the trust region β€” not the linear solver β€” bounds progress: the outer iteration keeps improving the bound but converges slowly and is capped by maxit_out (default 150) rather than by the gradient tolerance. For the best bound on large data, raise it, e.g. PLNPCA_param(backend = "builtin", config_optim = list(maxit_out = 300)).

3.5 PLNmixture β€” mixture-EM with builtin components

A PLN mixture with KK components is fitted by a genuine EM loop at the R level: an E-step updates the posterior class probabilities 𝛕\boldsymbol\tau (a soft-max of the component variational bounds plus the log mixing proportions), and an M-step re-fits each component as a weighted PLN β€” using the builtin (or nlopt) optimizer of Β§3.1 with observation weights Ο„ik\tau_{ik} β€” and updates the shared covariate effects. The mixing proportions are the column means of 𝛕\boldsymbol\tau.

4 Summary

  • All models maximise the same kind of Poisson log-normal variational bound; the optimizers differ in how they exploit its structure.
  • The builtin backend is a second-order method throughout: a joint (mean,variance)(\text{mean},\text{variance}) Newton VE-step everywhere, closed-form or graphical-lasso covariance M-steps, and β€” for PLNPCA β€” a profiled, saddle-aware trust-region Newton on the loadings.
  • nlopt remains the default for PLN/PLNPCA for its robustness; builtin is the default for the penalised and structured variants, where its second-order steps and better optima matter most. torch is experimental.