How to specify a model 🛠️

Published

March 27, 2025

1 Introduction

This tutorial demonstrates how to specify a model in pyPLNmodels. Two primary approaches are available:

R-style formulas (relying on the patsy package.)
direcly specifying arrays

The first approach is practical and more R-user-friendly, while the second gives more control over the input data.

We consider the more basic model of pyPLNmodels, namely the Pln model

1.1 Model

Given a count matrix \(Y\), the Pln model assumes the following (Aitchison and Ho 1989):

\[ Y_{ij}| Z_{ij} \sim \mathcal P(\exp(o_{ij} + Z_{ij})), \quad Z_{i}\sim \mathcal N(X_i^{\top} B, \Sigma),\] with input data

\(Y_{ij}\) (denoted endog in the package) is the \(j\)-th count for the \(i\)-th observation
\(X_i\) (denoted exog) covariates for the \(i\)-th observation (if available)
\(o_i\) (denoted offsets) offset for the \(i\)-th observation (if available)

and model parameters, inferred by the package:

\(B\) (denoted coef in the package) is a matrix of regression coefficients
\(\Sigma\) (denoted covariance in the package) is the covariance matrix of the latent variables \(Z_i\)

We review here how to create the design matrix \(X\) (i.e. create the exog data matrix).

2 Count Data

Let’s explore real count data from the microcosm dataset (Mariadassou et al. 2023), introduced in the NCBI BioProject PRJNA875059 and provided by the load_microcosm function in the package. This dataset features microbiota members sampled from various body sites of dairy cows throughout their lactation period. The data includes response count variables (endog) and explanatory variables:

Site information (site and site_1hot)
Time information (time and time_1hot)
Lineage information (lineage and lineage_1hot)

The 1_hot variables are one-hot encoded representations of the corresponding categorical variables.

2.1 Data importation

from pyPLNmodels import load_microcosm
micro = load_microcosm(dim = 20)
print('Data: ')
print(micro.keys())

Data:
dict_keys(['endog', 'site_1hot', 'site', 'lineage_1hot', 'lineage', 'time_1hot', 'time', 'affiliations'])

The counts (endog) are composed of \(400\) samples and \(20\) variables:

print('Endog shape: ', micro["endog"].shape)

Endog shape:  (400, 20)

On a log scale, here is the distribution of the counts:

import matplotlib.pyplot as plt
plt.hist(micro["endog"].values.ravel(), bins=100)
plt.yscale('log')
plt.show()

The counts are strongly zero-inflated here, but we do not make any zero-inflation hypothesis. To account for zero-inflation, please see the dedicated zero-inflation tutorial.

3 Model specifying

3.1 R-style formulas

The model can be initialized using R-style formulas, just as in the statsmodels package. Here is a quick overview

from pyPLNmodels import Pln

pln = Pln.from_formula("endog ~ 1 + site", data=micro)
print('exog:', pln.exog)
print('names:', pln.column_names_exog)

Setting the offsets to zero.
exog: tensor([[1., 0., 0., 0.],
        [1., 0., 0., 1.],
        [1., 0., 0., 1.],
        ...,
        [1., 0., 0., 1.],
        [1., 0., 0., 1.],
        [1., 1., 0., 0.]])
names: Index(['Intercept', 'site[T.L]', 'site[T.N]', 'site[T.V]'], dtype='object')

The formula specifies a model with an intercept (1) and the categorical covariate site. Internally, the formula is parsed using the patsy package, which handles categorical variables automatically by one-hot encoding them and excluding the reference level.

You can specify more complex models, such as including multiple covariates:

pln = Pln.from_formula("endog ~ 1 + site + time", data=micro)
print('exog:', pln.exog)
print('names:', pln.column_names_exog)

Setting the offsets to zero.
exog: tensor([[1., 0., 0.,  ..., 1., 0., 0.],
        [1., 0., 0.,  ..., 0., 1., 0.],
        [1., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [1., 0., 0.,  ..., 0., 0., 0.],
        [1., 0., 0.,  ..., 0., 0., 0.],
        [1., 1., 0.,  ..., 0., 1., 0.]])
names: Index(['Intercept', 'site[T.L]', 'site[T.N]', 'site[T.V]', 'time[T.1S]',
       'time[T.3M]', 'time[T.7M]'],
      dtype='object')

and interactions:

pln = Pln.from_formula("endog ~ 1 + site * time", data=micro)
print('exog:', pln.exog)
print('names:', pln.column_names_exog)

Setting the offsets to zero.
removing column site[T.L]:time[T.1S] as it is only zeros
exog: tensor([[1., 0., 0.,  ..., 0., 0., 0.],
        [1., 0., 0.,  ..., 0., 0., 0.],
        [1., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [1., 0., 0.,  ..., 0., 0., 0.],
        [1., 0., 0.,  ..., 0., 0., 0.],
        [1., 1., 0.,  ..., 0., 0., 0.]])
names: Index(['Intercept', 'site[T.L]', 'site[T.N]', 'site[T.V]', 'time[T.1S]',
       'time[T.3M]', 'time[T.7M]', 'site[T.L]:time[T.1S]',
       'site[T.N]:time[T.1S]', 'site[T.V]:time[T.1S]', 'site[T.L]:time[T.3M]',
       'site[T.N]:time[T.3M]', 'site[T.V]:time[T.3M]', 'site[T.L]:time[T.7M]',
       'site[T.N]:time[T.7M]', 'site[T.V]:time[T.7M]'],
      dtype='object')

Note that the left-hand side (endog) must always match the key for the count matrix in your data dictionary. By default, the offsets are set to zero unless you specify compute_offsets_method="logsum", in which case the log of the row sums of the counts is used:

pln = Pln.from_formula("endog ~ 1 + site", data=micro, compute_offsets_method="logsum")
print('offsets:', pln.offsets)

Setting the offsets as the log of the sum of `endog`.
offsets: tensor([[6.4281, 6.4281, 6.4281,  ..., 6.4281, 6.4281, 6.4281],
        [  -inf,   -inf,   -inf,  ...,   -inf,   -inf,   -inf],
        [4.7622, 4.7622, 4.7622,  ..., 4.7622, 4.7622, 4.7622],
        ...,
        [5.5053, 5.5053, 5.5053,  ..., 5.5053, 5.5053, 5.5053],
        [  -inf,   -inf,   -inf,  ...,   -inf,   -inf,   -inf],
        [5.3471, 5.3471, 5.3471,  ..., 5.3471, 5.3471, 5.3471]])

If the offsets key is in the data dictionary, then they are set to data["offsets"] automatically:

import numpy as np
tmp_micro = load_microcosm()
print("No offsets:", Pln.from_formula("endog ~ 1", data = tmp_micro).offsets)
tmp_micro["offsets"] = np.log(tmp_micro["endog"] + 2)
print("Dumb offsets:", Pln.from_formula("endog ~ 1", data = tmp_micro).offsets)

Setting the offsets to zero.
No offsets: tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]])
Taking the offsets from the data given.
Dumb offsets: tensor([[0.6931, 0.6931, 0.6931,  ..., 0.6931, 0.6931, 0.6931],
        [0.6931, 0.6931, 0.6931,  ..., 0.6931, 0.6931, 0.6931],
        [0.6931, 0.6931, 0.6931,  ..., 5.3519, 0.6931, 4.5539],
        ...,
        [0.6931, 0.6931, 0.6931,  ..., 0.6931, 0.6931, 0.6931],
        [0.6931, 0.6931, 0.6931,  ..., 0.6931, 0.6931, 0.6931],
        [4.5850, 0.6931, 0.6931,  ..., 0.6931, 0.6931, 4.3307]])

3.2 Specifying arrays

Instead of using a formula, you can pass the arrays directly using the class constructor. This offers more flexibility, especially if your covariates are already processed or you want full control over preprocessing.

Here is how to specify a model with site as covariates:

import pandas as pd
pln = Pln(
    endog=micro["endog"],
    exog=micro["site_1hot"],
    add_const=False,
    compute_offsets_method="logsum"  # use log row sums as offsets
)

Setting the offsets as the log of the sum of `endog`.

The exog matrix should be full-rank, otherwise the initialization will raise an error. By default, the model automatically adds a constant column (intercept) to the covariates. However, in cases where this would result in a non-full rank covariate matrix, you can disable this behavior by setting add_const=False, especially if your data already includes an intercept or is one hot encoded, just like the above case.

If you do not wish to include covariates at all, simply omit the exog argument:

pln = Pln(
    endog=micro["endog"]
)

Setting the offsets to zero.

This corresponds to a model with latent structure but no fixed effects (except the intercept, which can be removed by setting add_const to False).

4 Model fitting

Once instantiate, one can simply calls .fit() to fit the model and infer the model parameters:

pln.fit()

Fitting a Pln model with full covariance.
Intializing parameters ...
Initialization finished.

Upper bound on the fitting time:   0%|          | 0/400 [00:00<?, ?it/s]Upper bound on the fitting time:  18%|█▊        | 74/400 [00:00<00:00, 731.52it/s]Upper bound on the fitting time:  40%|███▉      | 158/400 [00:00<00:00, 790.41it/s]Upper bound on the fitting time:  60%|█████▉    | 238/400 [00:00<00:00, 736.03it/s]Upper bound on the fitting time:  78%|███████▊  | 313/400 [00:00<00:00, 731.76it/s]Upper bound on the fitting time:  97%|█████████▋| 387/400 [00:00<00:00, 730.88it/s]Upper bound on the fitting time: 100%|██████████| 400/400 [00:00<00:00, 736.00it/s]

Maximum number of iterations (400)  reached in 1.4 seconds.
Last  criterion = 5.37e-06 . Required tolerance = 1e-06

A multivariate Pln with full covariance.
======================================================================
     Loglike   Dimension    Nb param         BIC         AIC         ICL
    -2658.12          20         230   3347.1434    2888.125   -61995.72

======================================================================
* Useful attributes
    .latent_variables .latent_positions .coef .covariance .precision .model_parameters .latent_parameters .optim_details
* Useful methods
    .transform() .show() .predict() .sigma() .projected_latent_variables() .plot_correlation_circle() .biplot() .viz() .pca_pairplot() .plot_expected_vs_true()
* Additional attributes for Pln are:
    None
* Additional methods for Pln are:
    .get_variance_coef() .get_confidence_interval_coef() .summary() .get_coef_p_values() .plot_regression_forest()

It is possible to fit the model directyly after specifying it:

pln = Pln(endog = micro["endog"]).fit()

Setting the offsets to zero.
Fitting a Pln model with full covariance.
Intializing parameters ...
Initialization finished.

Upper bound on the fitting time:   0%|          | 0/400 [00:00<?, ?it/s]Upper bound on the fitting time:  18%|█▊        | 73/400 [00:00<00:00, 722.86it/s]Upper bound on the fitting time:  36%|███▋      | 146/400 [00:00<00:00, 723.58it/s]Upper bound on the fitting time:  55%|█████▍    | 219/400 [00:00<00:00, 724.82it/s]Upper bound on the fitting time:  73%|███████▎  | 292/400 [00:00<00:00, 725.33it/s]Upper bound on the fitting time:  91%|█████████▏| 365/400 [00:00<00:00, 724.18it/s]Upper bound on the fitting time: 100%|██████████| 400/400 [00:00<00:00, 723.65it/s]

Maximum number of iterations (400)  reached in 0.6 seconds.
Last  criterion = 5.37e-06 . Required tolerance = 1e-06

See the basic analysis for more insights on the output of the model.

References

Aitchison, John, and CH Ho. 1989. “The Multivariate Poisson-Log Normal Distribution.” Biometrika 76 (4): 643–53.

Batardière, Bastien, Joon Kwon, and Julien Chiquet. 2024. “pyPLNmodels: A Python Package to Analyze Multivariate High-Dimensional Count Data.” Journal of Open Source Software.

Mariadassou, Mahendra, Laurent X Nouvel, Fabienne Constant, Diego P Morgavi, Lucie Rault, Sarah Barbey, Emmanuelle Helloin, et al. 2023. “Microbiota Members from Body Sites of Dairy Cows Are Largely Shared Within Individual Hosts Throughout Lactation but Sharing Is Limited in the Herd.” Animal Microbiome 5 (1): 1–17. https://doi.org/10.1186/s42523-023-00252-w.