Basic analysis 🍼

Published

March 27, 2025

This guide introduces the basics of multivariate count data analysis using the pyPLNmodels package. For complex datasets, traditional models like Poisson regression and Zero-Inflated Poisson Regression (ZIP) may fail to capture important insights, such as correlations between counts (i.e. variables). The pyPLNmodels package provides two key models to address these challenges:

Pln model (documentation)
PlnPCA model (documentation)

The PlnPCA model extends the functionality of the Pln model to handle high-dimensional data, though it may slightly compromise parameter estimation accuracy.

1 Statistical Overview

Both models are built on the following assumptions for a given count matrix \(Y\) (Aitchison and Ho 1989):

\[ Y_{ij}| Z_{ij} \sim \mathcal P(\exp(o_{ij} + Z_{ij})), \quad Z_{i}\sim \mathcal N(X_i^{\top} B, \Sigma),\] where the input data includes:

\(Y_{ij}\) (endog): the \(j\)-th count for the \(i\)-th observation
\(X_i\) (exog): covariates for the \(i\)-th observation (if available)
\(o_i\) (offsets): offset for the \(i\)-th observation (if available)

and the model parameters are:

\(B\) (coef): a matrix of regression coefficients
\(\Sigma\) (covariance): the covariance matrix of the latent variables \(Z_i\)

These models aim to capture the structure of the data through the latent variables \(Z\).

The Pln model assumes \(\Sigma\) has full rank, while the PlnPCA model assumes \(\Sigma\) has a low rank, which must be specified by the user. A lower rank introduces a trade-off, reducing computational complexity but potentially compromising parameter estimation accuracy.

1.1 Purpose

The pyPLNmodels package is designed to:

Estimate the parameters \(B\) and \(\Sigma\)
Retrieve the latent variables \(Z\) (which typically contains more information than \(Y\))
Visualize the latent variables and their relationships

This is achieved using the input count matrix \(Y\), along with optional covariate matrix \(X\) (defaulting to a vector of 1s) and offsets \(O\) (defaulting to a matrix of 0s).

2 Importing Data

In this example, we analyze single-cell RNA-seq data provided by the load_scrna function in the package. Each column in the dataset represents a gene, while each row corresponds to a cell (i.e., an individual). Covariates for cell types (labels) are also included. For simplicity, we limit the analysis to \(10\) variables (dimensions).

from pyPLNmodels import load_scrna
rna = load_scrna(dim=10)
print('Data: ', rna.keys())

Returning scRNA dataset of size (400, 10)
Data:  dict_keys(['endog', 'labels', 'labels_1hot'])

2.1 Data Structure

2.1.1 Count Matrix (`endog`)

endog = rna["endog"]
print(endog.head())

     FTL  MALAT1   FTH1  TMSB4X   CD74  SPP1   APOE    B2M   ACTB  HLA-DRA
0    5.0    39.0   14.0    39.0   11.0   0.0    0.0   88.0   49.0      0.0
1    8.0    52.0   38.0   186.0   17.0   0.0    0.0  152.0  167.0      3.0
2   76.0    29.0   38.0    71.0   37.0   3.0    2.0   21.0    7.0     12.0
3  354.0   151.0  236.0   184.0  287.0   0.0  238.0  255.0   27.0    243.0
4    0.0    55.0   10.0    37.0    8.0   1.0    0.0   23.0   49.0      1.0

2.1.2 Cell Type

cell_type = rna["labels"]
print('Possible cell types: ', cell_type.unique())
print(cell_type.head())

Possible cell types:  ['T_cells_CD4+' 'Macrophages' 'T_cells_CD8+']
0    T_cells_CD4+
1    T_cells_CD4+
2     Macrophages
3     Macrophages
4    T_cells_CD4+
Name: standard_true_celltype_v5, dtype: object

3 Model Initialization and Fitting

To analyze the mean values for each cell type, we use cell type as a covariate, with Macrophages as the reference category:

from pyPLNmodels import Pln
pln = Pln.from_formula('endog ~ 1 + labels', data=rna).fit()

Setting the offsets to zero.
Fitting a Pln model with full covariance.
Intializing parameters ...
Initialization finished.

Upper bound on the fitting time:   0%|          | 0/400 [00:00<?, ?it/s]Upper bound on the fitting time:  21%|██        | 83/400 [00:00<00:00, 826.25it/s]Upper bound on the fitting time:  42%|████▏     | 168/400 [00:00<00:00, 837.12it/s]Upper bound on the fitting time:  65%|██████▍   | 259/400 [00:00<00:00, 869.92it/s]Upper bound on the fitting time:  90%|█████████ | 360/400 [00:00<00:00, 922.16it/s]Upper bound on the fitting time: 100%|██████████| 400/400 [00:00<00:00, 903.95it/s]

Maximum number of iterations (400)  reached in 1.3 seconds.
Last  criterion = 5.4e-06 . Required tolerance = 1e-06

For more details on formula syntax and model initialization, including handling offsets, refer to the dedicated tutorial.

3.1 Model Summary and Insights

After fitting the model, you can print its configuration and key details:

print(pln)

A multivariate Pln with full covariance.
======================================================================
     Loglike   Dimension    Nb param         BIC         AIC         ICL
   -12790.12          10          85  13044.7622   12875.125      6261.7

======================================================================
* Useful attributes
    .latent_variables .latent_positions .coef .covariance .precision .model_parameters .latent_parameters .optim_details
* Useful methods
    .transform() .show() .predict() .sigma() .projected_latent_variables() .plot_correlation_circle() .biplot() .viz() .pca_pairplot() .plot_expected_vs_true()
* Additional attributes for Pln are:
    None
* Additional methods for Pln are:
    .get_variance_coef() .get_confidence_interval_coef() .summary() .get_coef_p_values() .plot_regression_forest()

To gain deeper insights into the model parameters and the optimization process, use the .show() method:

pln.show()

Monitoring the norm of each parameter is essential to assess model convergence. If the model has not converged, consider refitting it with additional iterations and a reduced tolerance (tol). To adjust the number of iterations, use the maxiter parameter:

pln.fit(maxiter=1000, tol = 0).show()

Fitting a Pln model with full covariance.

Upper bound on the fitting time:   0%|          | 0/1000 [00:00<?, ?it/s]Upper bound on the fitting time:  11%|█         | 107/1000 [00:00<00:00, 1061.61it/s]Upper bound on the fitting time:  21%|██▏       | 214/1000 [00:00<00:00, 1056.11it/s]Upper bound on the fitting time:  32%|███▏      | 323/1000 [00:00<00:00, 1069.60it/s]Upper bound on the fitting time:  43%|████▎     | 431/1000 [00:00<00:00, 1073.67it/s]Upper bound on the fitting time:  54%|█████▍    | 539/1000 [00:00<00:00, 1075.09it/s]Upper bound on the fitting time:  65%|██████▍   | 647/1000 [00:00<00:00, 1055.11it/s]Upper bound on the fitting time:  75%|███████▌  | 753/1000 [00:00<00:00, 1041.14it/s]Upper bound on the fitting time:  86%|████████▌ | 858/1000 [00:00<00:00, 1022.65it/s]Upper bound on the fitting time:  96%|█████████▌| 961/1000 [00:00<00:00, 1024.86it/s]Upper bound on the fitting time: 100%|██████████| 1000/1000 [00:00<00:00, 1044.85it/s]

Maximum number of iterations (1400)  reached in 2.2 seconds.
Last  criterion = 0.0 . Required tolerance = 0

3.2 Exploring Latent Variables

The latent variables \(Z\), which capture the underlying structure of the data, are accessible via the latent_variables attribute, or the .transform() method:

Z = pln.latent_variables
Z = pln.transform()
print('Shape of Z:', Z.shape)

Shape of Z: torch.Size([400, 10])

You can visualize these latent variables using the .viz() method:

pln.viz(colors=cell_type)

By default the effect of covariates on the latent variables is included in the visualization. This means that the latent variables are represented as \(Z + XB\). The effect of covariates on the latent variables can be removed by using the remove_exog_effect keyword:

Z_moins_XB = pln.transform(remove_exog_effect=True)

To visualize the latent positions without the effect of covariates (i.e., (Z - XB)), set the remove_exog_effect parameter to True in the .viz() method:

pln.viz(colors=cell_type, remove_exog_effect=True)

Additionally, you can generate a pair plot of the first Principal Components (PCs) of the latent variables:

pln.pca_pairplot(n_components=4, colors=cell_type)

The remove_exog_effect parameter is also available in the pca_pairplot method.

4 Analyzing Covariate Effects

The model provides insights into the effects of covariates. For example, it may reveal that the mean value for Macrophages is higher compared to T_cells_CD4+ and T_cells_CD8+.

4.1 Confidence Intervals and Statistical Significance

To summarize the model, including confidence intervals and p-values, use the summary method:

pln.summary()

Coefficients and p-values per dimension:

Dimension: FTL
Exog Name                Coefficient         P-value
Intercept                   4.408599           1e-16
labels[T.T_cells_CD4+]       -3.255301           1e-16
labels[T.T_cells_CD8+]       -3.704965           1e-16

Dimension: MALAT1
Exog Name                Coefficient         P-value
Intercept                   4.496646           1e-16
labels[T.T_cells_CD4+]        0.044992            0.75
labels[T.T_cells_CD8+]       -0.602403         7.8e-08

Dimension: FTH1
Exog Name                Coefficient         P-value
Intercept                   3.989528           1e-16
labels[T.T_cells_CD4+]       -2.125104           1e-16
labels[T.T_cells_CD8+]       -3.118913           1e-16

Dimension: TMSB4X
Exog Name                Coefficient         P-value
Intercept                   4.338097           1e-16
labels[T.T_cells_CD4+]       -0.861660         6.8e-05
labels[T.T_cells_CD8+]       -1.246957           1e-16

Dimension: CD74
Exog Name                Coefficient         P-value
Intercept                   3.903079           1e-16
labels[T.T_cells_CD4+]       -3.282655           1e-16
labels[T.T_cells_CD8+]       -3.335481           1e-16

Dimension: SPP1
Exog Name                Coefficient         P-value
Intercept                   0.209591            0.19
labels[T.T_cells_CD4+]       -4.515602           1e-16
labels[T.T_cells_CD8+]       -5.166453           1e-16

Dimension: APOE
Exog Name                Coefficient         P-value
Intercept                   0.926648           2e-11
labels[T.T_cells_CD4+]       -4.293190         2.2e-16
labels[T.T_cells_CD8+]       -4.383797           1e-16

Dimension: B2M
Exog Name                Coefficient         P-value
Intercept                   3.912995           1e-16
labels[T.T_cells_CD4+]       -0.220007            0.67
labels[T.T_cells_CD8+]       -0.583185         0.00023

Dimension: ACTB
Exog Name                Coefficient         P-value
Intercept                   3.496265           1e-16
labels[T.T_cells_CD4+]       -1.082953         0.00027
labels[T.T_cells_CD8+]       -1.466583         3.5e-09

Dimension: HLA-DRA
Exog Name                Coefficient         P-value
Intercept                   3.575849           1e-16
labels[T.T_cells_CD4+]       -4.725000           1e-16
labels[T.T_cells_CD8+]       -4.348214           1e-16

The p-value corresponds to the coding used in one-hot encoding, with Macrophages set as the reference category.

You can also visualize confidence intervals for regression coefficients using the plot_regression_forest method:

pln.plot_regression_forest()

5 Covariance and Variable Visualization

The pyPLNmodels package provides tools to analyze the effects of variables and visualize their relationships using the plot_correlation_circle() and biplot() methods.

5.1 Correlation Circle

Use the plot_correlation_circle() method to visualize the relationships between variables:

pln.plot_correlation_circle(column_names=["FTL", "MALAT1", "FTH1"])

You can access the column names using the column_names_endog attribute of the model.

5.2 Biplot

The biplot() method allows simultaneous visualization of variables and latent variables ((Z)):

pln.biplot(column_names=["FTL", "MALAT1", "FTH1"], colors=cell_type)

6 High-dimension: `PlnPCA` Model

The PlnPCA model (Chiquet, Mariadassou, and Robin 2018) extends the functionality of the Pln model for high-dimensional data. It uses the same syntax as Pln:

from pyPLNmodels import PlnPCA
high_d_rna = load_scrna(dim=500)
pca = PlnPCA.from_formula('endog ~ 1 + labels', data=high_d_rna, rank=5).fit()

Returning scRNA dataset of size (400, 500)
Setting the offsets to zero.
Fitting a PlnPCA model with 5 principal components.
Intializing parameters ...
Initialization finished.

Upper bound on the fitting time:   0%|          | 0/400 [00:00<?, ?it/s]Upper bound on the fitting time:  12%|█▎        | 50/400 [00:00<00:00, 497.67it/s]Upper bound on the fitting time:  25%|██▌       | 100/400 [00:00<00:00, 480.72it/s]Upper bound on the fitting time:  37%|███▋      | 149/400 [00:00<00:00, 465.31it/s]Upper bound on the fitting time:  50%|████▉     | 198/400 [00:00<00:00, 472.47it/s]Upper bound on the fitting time:  62%|██████▎   | 250/400 [00:00<00:00, 487.23it/s]Upper bound on the fitting time:  76%|███████▌  | 302/400 [00:00<00:00, 497.51it/s]Upper bound on the fitting time:  89%|████████▉ | 355/400 [00:00<00:00, 505.50it/s]Upper bound on the fitting time: 100%|██████████| 400/400 [00:00<00:00, 495.90it/s]

Maximum number of iterations (400)  reached in 1.0 seconds.
Last  criterion = 5.4e-06 . Required tolerance = 1e-06

⚠️ Note: P-values are not available in the PlnPCA model.

print(pca)

A multivariate PlnPCA with 5 principal components.
======================================================================
     Loglike   Dimension    Nb param         BIC         AIC         ICL
  -330657.75         500        3990 342610.7218   334647.75   339757.06

======================================================================
* Useful attributes
    .latent_variables .latent_positions .coef .covariance .precision .model_parameters .latent_parameters .optim_details
* Useful methods
    .transform() .show() .predict() .sigma() .projected_latent_variables() .plot_correlation_circle() .biplot() .viz() .pca_pairplot() .plot_expected_vs_true()
* Additional attributes for PlnPCA are:
    None
* Additional methods for PlnPCA are:
    None

A low-dimensional of dimension rank of the latent variables can be obtained using the project keyword of the .transform() method:

Z_low_dim = pca.transform(project=True)
print('Shape of Z_low_dim:', Z_low_dim.shape)

Shape of Z_low_dim: torch.Size([400, 5])

This model is particularly efficient for high-dimensional datasets, offering significantly reduced computation time compared to Pln. See this paper for a computational comparison between Pln and PlnPCA

6.1 Selecting the Rank

The rank is a hyperparameter that can be specified by the user. Alternatively, a data-driven approach to rank selection can be performed using the PlnPCACollection class, which fits multiple models with different ranks:

from pyPLNmodels import PlnPCACollection
pcas = PlnPCACollection.from_formula('endog ~ 1 + labels', data=high_d_rna, ranks=[3, 5, 10, 15]).fit()

Setting the offsets to zero.
Adjusting 4 PlnPCA models.

Fitting a PlnPCA model with 3 principal components.
Intializing parameters ...
Initialization finished.

Upper bound on the fitting time:   0%|          | 0/400 [00:00<?, ?it/s]Upper bound on the fitting time:  13%|█▎        | 51/400 [00:00<00:00, 505.80it/s]Upper bound on the fitting time:  26%|██▋       | 106/400 [00:00<00:00, 529.37it/s]Upper bound on the fitting time:  40%|████      | 161/400 [00:00<00:00, 537.43it/s]Upper bound on the fitting time:  54%|█████▍    | 216/400 [00:00<00:00, 541.60it/s]Upper bound on the fitting time:  68%|██████▊   | 272/400 [00:00<00:00, 547.74it/s]Upper bound on the fitting time:  82%|████████▏ | 328/400 [00:00<00:00, 550.41it/s]Upper bound on the fitting time:  96%|█████████▌| 384/400 [00:00<00:00, 549.51it/s]Upper bound on the fitting time: 100%|██████████| 400/400 [00:00<00:00, 543.79it/s]

Maximum number of iterations (400)  reached in 0.9 seconds.
Last  criterion = 5.4e-06 . Required tolerance = 1e-06
Fitting a PlnPCA model with 5 principal components.
Intializing parameters ...
Initialization finished.

Upper bound on the fitting time:   0%|          | 0/400 [00:00<?, ?it/s]Upper bound on the fitting time:  13%|█▎        | 52/400 [00:00<00:00, 517.32it/s]Upper bound on the fitting time:  26%|██▌       | 104/400 [00:00<00:00, 503.69it/s]Upper bound on the fitting time:  39%|███▉      | 155/400 [00:00<00:00, 493.49it/s]Upper bound on the fitting time:  51%|█████▏    | 205/400 [00:00<00:00, 467.65it/s]Upper bound on the fitting time:  63%|██████▎   | 252/400 [00:00<00:00, 462.31it/s]Upper bound on the fitting time:  75%|███████▍  | 299/400 [00:00<00:00, 463.64it/s]Upper bound on the fitting time:  87%|████████▋ | 348/400 [00:00<00:00, 471.12it/s]Upper bound on the fitting time: 100%|█████████▉| 399/400 [00:00<00:00, 483.13it/s]Upper bound on the fitting time: 100%|██████████| 400/400 [00:00<00:00, 478.59it/s]

Maximum number of iterations (400)  reached in 0.8 seconds.
Last  criterion = 5.4e-06 . Required tolerance = 1e-06
Fitting a PlnPCA model with 10 principal components.
Intializing parameters ...
Initialization finished.

Upper bound on the fitting time:   0%|          | 0/400 [00:00<?, ?it/s]Upper bound on the fitting time:  12%|█▏        | 48/400 [00:00<00:00, 478.78it/s]Upper bound on the fitting time:  24%|██▍       | 96/400 [00:00<00:00, 476.02it/s]Upper bound on the fitting time:  36%|███▌      | 144/400 [00:00<00:00, 472.61it/s]Upper bound on the fitting time:  48%|████▊     | 192/400 [00:00<00:00, 472.11it/s]Upper bound on the fitting time:  60%|██████    | 240/400 [00:00<00:00, 473.42it/s]Upper bound on the fitting time:  72%|███████▏  | 288/400 [00:00<00:00, 473.25it/s]Upper bound on the fitting time:  84%|████████▍ | 336/400 [00:00<00:00, 469.16it/s]Upper bound on the fitting time:  96%|█████████▌| 384/400 [00:00<00:00, 472.00it/s]Upper bound on the fitting time: 100%|██████████| 400/400 [00:00<00:00, 471.98it/s]

Maximum number of iterations (400)  reached in 0.8 seconds.
Last  criterion = 5.4e-06 . Required tolerance = 1e-06
Fitting a PlnPCA model with 15 principal components.
Intializing parameters ...
Initialization finished.

Upper bound on the fitting time:   0%|          | 0/400 [00:00<?, ?it/s]Upper bound on the fitting time:  12%|█▎        | 50/400 [00:00<00:00, 495.53it/s]Upper bound on the fitting time:  25%|██▌       | 100/400 [00:00<00:00, 493.94it/s]Upper bound on the fitting time:  38%|███▊      | 150/400 [00:00<00:00, 494.76it/s]Upper bound on the fitting time:  50%|█████     | 201/400 [00:00<00:00, 498.12it/s]Upper bound on the fitting time:  63%|██████▎   | 252/400 [00:00<00:00, 500.57it/s]Upper bound on the fitting time:  76%|███████▌  | 303/400 [00:00<00:00, 500.40it/s]Upper bound on the fitting time:  88%|████████▊ | 354/400 [00:00<00:00, 480.49it/s]Upper bound on the fitting time: 100%|██████████| 400/400 [00:00<00:00, 487.86it/s]

Maximum number of iterations (400)  reached in 0.8 seconds.
Last  criterion = 5.4e-06 . Required tolerance = 1e-06
======================================================================

DONE!
Best model (lower BIC): rank 15
    Best model(lower AIC): rank  15

======================================================================

Use the .show() method to explore insights about the optimal rank:

pcas.show()

⚠️ Note: The best model typically correspond to the largest rank, which may not always be desirable.

The best model can be accessed using the .best_model() method. The selection criterion can be AIC, BIC, or ICL:

best_model = pcas.best_model(criterion="BIC")
print(best_model)

A multivariate PlnPCA with 15 principal components.
======================================================================
     Loglike   Dimension    Nb param         BIC         AIC         ICL
   -281405.0         500        8895 308052.0386    290300.0   299389.87

======================================================================
* Useful attributes
    .latent_variables .latent_positions .coef .covariance .precision .model_parameters .latent_parameters .optim_details
* Useful methods
    .transform() .show() .predict() .sigma() .projected_latent_variables() .plot_correlation_circle() .biplot() .viz() .pca_pairplot() .plot_expected_vs_true()
* Additional attributes for PlnPCA are:
    None
* Additional methods for PlnPCA are:
    None

6.2 Accessing Individual Models

All individual models in the collection can be accessed with the rank as the key:

pca_5 = pcas[5]
print(pca_5)

A multivariate PlnPCA with 5 principal components.
======================================================================
     Loglike   Dimension    Nb param         BIC         AIC         ICL
  -330615.38         500        3990 342568.3468  334605.375   339716.13

======================================================================
* Useful attributes
    .latent_variables .latent_positions .coef .covariance .precision .model_parameters .latent_parameters .optim_details
* Useful methods
    .transform() .show() .predict() .sigma() .projected_latent_variables() .plot_correlation_circle() .biplot() .viz() .pca_pairplot() .plot_expected_vs_true()
* Additional attributes for PlnPCA are:
    None
* Additional methods for PlnPCA are:
    None

You can also iterate through the collection:

for pca in pcas.values():
    print(pca)

A multivariate PlnPCA with 3 principal components.
======================================================================
     Loglike   Dimension    Nb param         BIC         AIC         ICL
  -363016.88         500        2997 371995.0846  366013.875   370287.37

======================================================================
* Useful attributes
    .latent_variables .latent_positions .coef .covariance .precision .model_parameters .latent_parameters .optim_details
* Useful methods
    .transform() .show() .predict() .sigma() .projected_latent_variables() .plot_correlation_circle() .biplot() .viz() .pca_pairplot() .plot_expected_vs_true()
* Additional attributes for PlnPCA are:
    None
* Additional methods for PlnPCA are:
    None
A multivariate PlnPCA with 5 principal components.
======================================================================
     Loglike   Dimension    Nb param         BIC         AIC         ICL
  -330615.38         500        3990 342568.3468  334605.375   339716.13

======================================================================
* Useful attributes
    .latent_variables .latent_positions .coef .covariance .precision .model_parameters .latent_parameters .optim_details
* Useful methods
    .transform() .show() .predict() .sigma() .projected_latent_variables() .plot_correlation_circle() .biplot() .viz() .pca_pairplot() .plot_expected_vs_true()
* Additional attributes for PlnPCA are:
    None
* Additional methods for PlnPCA are:
    None
A multivariate PlnPCA with 10 principal components.
======================================================================
     Loglike   Dimension    Nb param         BIC         AIC         ICL
   -295788.5         500        6455 315125.9518    302243.5   309395.58

======================================================================
* Useful attributes
    .latent_variables .latent_positions .coef .covariance .precision .model_parameters .latent_parameters .optim_details
* Useful methods
    .transform() .show() .predict() .sigma() .projected_latent_variables() .plot_correlation_circle() .biplot() .viz() .pca_pairplot() .plot_expected_vs_true()
* Additional attributes for PlnPCA are:
    None
* Additional methods for PlnPCA are:
    None
A multivariate PlnPCA with 15 principal components.
======================================================================
     Loglike   Dimension    Nb param         BIC         AIC         ICL
   -281405.0         500        8895 308052.0386    290300.0   299389.87

======================================================================
* Useful attributes
    .latent_variables .latent_positions .coef .covariance .precision .model_parameters .latent_parameters .optim_details
* Useful methods
    .transform() .show() .predict() .sigma() .projected_latent_variables() .plot_correlation_circle() .biplot() .viz() .pca_pairplot() .plot_expected_vs_true()
* Additional attributes for PlnPCA are:
    None
* Additional methods for PlnPCA are:
    None

6.3 Additional Information

For further details, print the model to explore its methods and attributes.

References

Aitchison, John, and CH Ho. 1989. “The Multivariate Poisson-Log Normal Distribution.” Biometrika 76 (4): 643–53.

Batardière, Bastien, Joon Kwon, and Julien Chiquet. 2024. “pyPLNmodels: A Python Package to Analyze Multivariate High-Dimensional Count Data.” Journal of Open Source Software.

Chiquet, J, M Mariadassou, and S Robin. 2018. “Variational Inference for Probabilistic Poisson PCA.” The Annals of Applied Statistics.