from pyPLNmodels import load_microcosm
= load_microcosm(dim = 20)
micro print('Data: ')
print(micro.keys())
Data:
dict_keys(['endog', 'site_1hot', 'site', 'lineage_1hot', 'lineage', 'time_1hot', 'time', 'affiliations'])
This tutorial demonstrates how to specify a model in pyPLNmodels
. Two primary approaches are available:
The first approach is practical and more R-user-friendly, while the second gives more control over the input data.
We consider the more basic model of pyPLNmodels
, namely the Pln
model
Given a count matrix \(Y\), the Pln
model assumes the following (Aitchison and Ho 1989):
\[ Y_{ij}| Z_{ij} \sim \mathcal P(\exp(o_{ij} + Z_{ij})), \quad Z_{i}\sim \mathcal N(X_i^{\top} B, \Sigma),\] with input data
endog
in the package) is the \(j\)-th count for the \(i\)-th observationexog
) covariates for the \(i\)-th observation (if available)offsets
) offset for the \(i\)-th observation (if available)and model parameters, inferred by the package:
coef
in the package) is a matrix of regression coefficientscovariance
in the package) is the covariance matrix of the latent variables \(Z_i\)We review here how to create the design matrix \(X\) (i.e.Β create the exog
data matrix).
Letβs explore real count data from the microcosm dataset (Mariadassou et al. 2023), introduced in the NCBI BioProject and provided by the load_microcsom
function in the package. This dataset features microbiota members sampled from various body sites of dairy cows throughout their lactation period. The data includes response count variables (endog
) and explanatory variables:
site
and site_1hot
)time
and time_1hot
)lineage
and lineage_1hot
)The 1_hot
variables are one-hot encoded representations of the corresponding categorical variables.
from pyPLNmodels import load_microcosm
= load_microcosm(dim = 20)
micro print('Data: ')
print(micro.keys())
Data:
dict_keys(['endog', 'site_1hot', 'site', 'lineage_1hot', 'lineage', 'time_1hot', 'time', 'affiliations'])
The counts (endog
) are composed of \(400\) samples and \(20\) variables:
print('Endog shape: ', micro["endog"].shape)
Endog shape: (400, 20)
On a log scale, here is the distribution of the counts:
import matplotlib.pyplot as plt
"endog"].values.ravel(), bins=100)
plt.hist(micro['log')
plt.yscale( plt.show()
The counts are strongly zero-inflated here, but we do not make any zero-inflation hypothesis. To account for zero-inflation, please see the dedicated zero-inflation tutorial.
The model can be initialized using R-style formulas, just as in the statsmodels
package. Here is a quick overview
from pyPLNmodels import Pln
= Pln.from_formula("endog ~ 1 + site", data=micro)
pln print('exog:', pln.exog)
print('names:', pln.column_names_exog)
Setting the offsets to zero.
exog: tensor([[1., 0., 0., 0.],
[1., 0., 0., 1.],
[1., 0., 0., 1.],
...,
[1., 0., 0., 1.],
[1., 0., 0., 1.],
[1., 1., 0., 0.]])
names: Index(['Intercept', 'site[T.L]', 'site[T.N]', 'site[T.V]'], dtype='object')
The formula specifies a model with an intercept (1) and the categorical covariate site. Internally, the formula is parsed using the patsy package, which handles categorical variables automatically by one-hot encoding them and excluding the reference level.
You can specify more complex models, such as including multiple covariates:
= Pln.from_formula("endog ~ 1 + site + time", data=micro)
pln print('exog:', pln.exog)
print('names:', pln.column_names_exog)
Setting the offsets to zero.
exog: tensor([[1., 0., 0., ..., 1., 0., 0.],
[1., 0., 0., ..., 0., 1., 0.],
[1., 0., 0., ..., 0., 0., 0.],
...,
[1., 0., 0., ..., 0., 0., 0.],
[1., 0., 0., ..., 0., 0., 0.],
[1., 1., 0., ..., 0., 1., 0.]])
names: Index(['Intercept', 'site[T.L]', 'site[T.N]', 'site[T.V]', 'time[T.1S]',
'time[T.3M]', 'time[T.7M]'],
dtype='object')
and interactions:
= Pln.from_formula("endog ~ 1 + site * time", data=micro)
pln print('exog:', pln.exog)
print('names:', pln.column_names_exog)
Setting the offsets to zero.
removing column site[T.L]:time[T.1S] as it is only zeros
exog: tensor([[1., 0., 0., ..., 0., 0., 0.],
[1., 0., 0., ..., 0., 0., 0.],
[1., 0., 0., ..., 0., 0., 0.],
...,
[1., 0., 0., ..., 0., 0., 0.],
[1., 0., 0., ..., 0., 0., 0.],
[1., 1., 0., ..., 0., 0., 0.]])
names: Index(['Intercept', 'site[T.L]', 'site[T.N]', 'site[T.V]', 'time[T.1S]',
'time[T.3M]', 'time[T.7M]', 'site[T.L]:time[T.1S]',
'site[T.N]:time[T.1S]', 'site[T.V]:time[T.1S]', 'site[T.L]:time[T.3M]',
'site[T.N]:time[T.3M]', 'site[T.V]:time[T.3M]', 'site[T.L]:time[T.7M]',
'site[T.N]:time[T.7M]', 'site[T.V]:time[T.7M]'],
dtype='object')
Note that the left-hand side (endog
) must always match the key for the count matrix in your data dictionary. The offset is computed automatically unless explicitly provided using the offsets
key in the data dictionary. By default, it is set to zero unless you specify compute_offsets_method="logsum"
, in which case the log of the row sums of the counts is used.
import numpy as np
= load_microcosm()
tmp_micro print("No offsets:", Pln.from_formula("endog ~ 1", data = tmp_micro).offsets)
"offsets"] = np.log(tmp_micro["endog"] + 2)
tmp_micro[print("Dumb offsets:", Pln.from_formula("endog ~ 1", data = tmp_micro).offsets)
Setting the offsets to zero.
No offsets: tensor([[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]])
Taking the offsets from the data given.
Dumb offsets: tensor([[0.6931, 0.6931, 0.6931, ..., 0.6931, 0.6931, 0.6931],
[0.6931, 0.6931, 0.6931, ..., 0.6931, 0.6931, 0.6931],
[0.6931, 0.6931, 0.6931, ..., 5.3519, 0.6931, 4.5539],
...,
[0.6931, 0.6931, 0.6931, ..., 0.6931, 0.6931, 0.6931],
[0.6931, 0.6931, 0.6931, ..., 0.6931, 0.6931, 0.6931],
[4.5850, 0.6931, 0.6931, ..., 0.6931, 0.6931, 4.3307]])
Instead of using a formula, you can pass the arrays directly using the class constructor. This offers more flexibility, especially if your covariates are already processed or you want full control over preprocessing.
Here is how to specify a model with site as covariates:
import pandas as pd
= Pln(
pln =micro["endog"],
endog=micro["site_1hot"],
exog=False,
add_const="logsum" # use log row sums as offsets
compute_offsets_method )
Setting the offsets as the log of the sum of `endog`.
The exog
matrix should be full-rank, otherwise the initialization will raise an error. By default, the model automatically adds a constant column (intercept) to the covariates. However, in cases where this would result in a non-full rank covariate matrix, you can disable this behavior by setting add_const=False
, especially if your data already includes an intercept or is one hot encoded, just like the above case.
If you do not wish to include covariates at all, simply omit the exog argument:
= Pln(
pln =micro["endog"]
endog )
Setting the offsets to zero.
This corresponds to a model with latent structure but no fixed effects (except the intercept, which can be removed by setting add_const
to False
).
Once instantiate, one can simply calls .fit()
to fit the model and infer the model parameters:
pln.fit()
Fitting a Pln model with full covariance.
Intializing parameters ...
Initialization finished.
Maximum number of iterations (400) reached in 5.1 seconds.
Last criterion = 5.37e-06 . Required tolerance = 1e-06
Upper bound on the fitting time: 0%| | 0/400 [00:00<?, ?it/s]Upper bound on the fitting time: 4%|β | 15/400 [00:00<00:02, 140.65it/s]Upper bound on the fitting time: 8%|β | 30/400 [00:00<00:02, 138.22it/s]Upper bound on the fitting time: 11%|β | 44/400 [00:00<00:02, 136.70it/s]Upper bound on the fitting time: 14%|ββ | 58/400 [00:00<00:02, 137.88it/s]Upper bound on the fitting time: 18%|ββ | 73/400 [00:00<00:02, 138.82it/s]Upper bound on the fitting time: 22%|βββ | 88/400 [00:00<00:02, 138.78it/s]Upper bound on the fitting time: 26%|βββ | 102/400 [00:00<00:02, 138.09it/s]Upper bound on the fitting time: 29%|βββ | 117/400 [00:00<00:02, 139.50it/s]Upper bound on the fitting time: 33%|ββββ | 132/400 [00:00<00:01, 140.16it/s]Upper bound on the fitting time: 37%|ββββ | 147/400 [00:01<00:01, 140.27it/s]Upper bound on the fitting time: 40%|ββββ | 162/400 [00:01<00:01, 139.73it/s]Upper bound on the fitting time: 44%|βββββ | 176/400 [00:01<00:01, 136.92it/s]Upper bound on the fitting time: 48%|βββββ | 190/400 [00:01<00:01, 137.73it/s]Upper bound on the fitting time: 51%|ββββββ | 205/400 [00:01<00:01, 138.41it/s]Upper bound on the fitting time: 55%|ββββββ | 220/400 [00:01<00:01, 139.13it/s]Upper bound on the fitting time: 58%|ββββββ | 234/400 [00:01<00:01, 137.23it/s]Upper bound on the fitting time: 62%|βββββββ | 248/400 [00:01<00:01, 136.84it/s]Upper bound on the fitting time: 66%|βββββββ | 262/400 [00:01<00:01, 135.08it/s]Upper bound on the fitting time: 69%|βββββββ | 276/400 [00:02<00:00, 136.14it/s]Upper bound on the fitting time: 73%|ββββββββ | 291/400 [00:02<00:00, 137.61it/s]Upper bound on the fitting time: 76%|ββββββββ | 305/400 [00:02<00:00, 137.62it/s]Upper bound on the fitting time: 80%|ββββββββ | 319/400 [00:02<00:00, 137.34it/s]Upper bound on the fitting time: 83%|βββββββββ | 333/400 [00:02<00:00, 137.42it/s]Upper bound on the fitting time: 87%|βββββββββ | 347/400 [00:02<00:00, 135.79it/s]Upper bound on the fitting time: 90%|βββββββββ | 362/400 [00:02<00:00, 137.36it/s]Upper bound on the fitting time: 94%|ββββββββββ| 377/400 [00:02<00:00, 138.52it/s]Upper bound on the fitting time: 98%|ββββββββββ| 392/400 [00:02<00:00, 139.39it/s]Upper bound on the fitting time: 100%|ββββββββββ| 400/400 [00:02<00:00, 138.13it/s]
A multivariate Pln with full covariance.
======================================================================
Loglike Dimension Nb param BIC AIC ICL
-2658.12 20 230 3347.1434 2888.125 -61972.2
======================================================================
* Useful attributes
.latent_variables .latent_positions .coef .covariance .precision .model_parameters .latent_parameters .optim_details
* Useful methods
.transform() .show() .predict() .sigma() .projected_latent_variables() .plot_correlation_circle() .biplot() .viz() .pca_pairplot() .plot_expected_vs_true()
* Additional attributes for Pln are:
None
* Additional methods for Pln are:
.get_variance_coef() .get_confidence_interval_coef() .summary() .get_coef_p_values() .plot_regression_forest()
It is possible to fit the model directyly after specifying it:
= Pln(endog = micro["endog"]).fit() pln
Setting the offsets to zero.
Fitting a Pln model with full covariance.
Intializing parameters ...
Initialization finished.
Maximum number of iterations (400) reached in 3.0 seconds.
Last criterion = 5.37e-06 . Required tolerance = 1e-06
Upper bound on the fitting time: 0%| | 0/400 [00:00<?, ?it/s]Upper bound on the fitting time: 4%|β | 15/400 [00:00<00:02, 144.04it/s]Upper bound on the fitting time: 8%|β | 30/400 [00:00<00:02, 139.79it/s]Upper bound on the fitting time: 11%|β | 44/400 [00:00<00:02, 138.73it/s]Upper bound on the fitting time: 14%|ββ | 58/400 [00:00<00:02, 137.19it/s]Upper bound on the fitting time: 18%|ββ | 72/400 [00:00<00:02, 136.87it/s]Upper bound on the fitting time: 22%|βββ | 86/400 [00:00<00:02, 136.85it/s]Upper bound on the fitting time: 25%|βββ | 100/400 [00:00<00:02, 134.37it/s]Upper bound on the fitting time: 28%|βββ | 114/400 [00:00<00:02, 133.20it/s]Upper bound on the fitting time: 32%|ββββ | 128/400 [00:00<00:02, 133.72it/s]Upper bound on the fitting time: 36%|ββββ | 142/400 [00:01<00:01, 135.15it/s]Upper bound on the fitting time: 39%|ββββ | 156/400 [00:01<00:01, 135.10it/s]Upper bound on the fitting time: 42%|βββββ | 170/400 [00:01<00:01, 136.15it/s]Upper bound on the fitting time: 46%|βββββ | 184/400 [00:01<00:01, 134.65it/s]Upper bound on the fitting time: 50%|βββββ | 198/400 [00:01<00:01, 135.40it/s]Upper bound on the fitting time: 53%|ββββββ | 212/400 [00:01<00:01, 135.65it/s]Upper bound on the fitting time: 56%|ββββββ | 226/400 [00:01<00:01, 136.77it/s]Upper bound on the fitting time: 60%|ββββββ | 241/400 [00:01<00:01, 138.08it/s]Upper bound on the fitting time: 64%|βββββββ | 255/400 [00:01<00:01, 138.13it/s]Upper bound on the fitting time: 67%|βββββββ | 269/400 [00:01<00:00, 137.21it/s]Upper bound on the fitting time: 71%|βββββββ | 283/400 [00:02<00:00, 131.82it/s]Upper bound on the fitting time: 74%|ββββββββ | 297/400 [00:02<00:00, 133.29it/s]Upper bound on the fitting time: 78%|ββββββββ | 311/400 [00:02<00:00, 134.50it/s]Upper bound on the fitting time: 81%|βββββββββ | 325/400 [00:02<00:00, 134.87it/s]Upper bound on the fitting time: 85%|βββββββββ | 339/400 [00:02<00:00, 136.08it/s]Upper bound on the fitting time: 88%|βββββββββ | 353/400 [00:02<00:00, 134.38it/s]Upper bound on the fitting time: 92%|ββββββββββ| 367/400 [00:02<00:00, 134.01it/s]Upper bound on the fitting time: 95%|ββββββββββ| 381/400 [00:02<00:00, 135.08it/s]Upper bound on the fitting time: 99%|ββββββββββ| 396/400 [00:02<00:00, 137.63it/s]Upper bound on the fitting time: 100%|ββββββββββ| 400/400 [00:02<00:00, 135.70it/s]
See the basic analysis for more insights on the output of the model.