Clustering of count data 🔍

Published

March 27, 2025

1 Introduction

The Poisson lognormal (PLN) model can be extended to uncover clustering structures in count data. Two main approaches are available:

Supervised clustering: when cluster labels are known for some samples (PlnLDA, documentation).
Unsupervised clustering: when cluster labels are unknown (PlnMixture, documentation).

We illustrate both using a single-cell RNA-seq dataset provided by the load_scrna function of the package. Each row represents a cell and each column a gene. Cell types are included as labels and used in the supervised setting.

2 Data Loading and Structure

from pyPLNmodels import load_scrna
rna = load_scrna(dim=20)
print("Data keys:", rna.keys())

Returning scRNA dataset of size (400, 20)
Data keys: dict_keys(['endog', 'labels', 'labels_1hot'])

2.1 Count Matrix (`endog`)

endog = rna["endog"]
print(endog.head())

     FTL  MALAT1   FTH1  TMSB4X   CD74  SPP1   APOE    B2M   ACTB  HLA-DRA  \
0    5.0    39.0   14.0    39.0   11.0   0.0    0.0   88.0   49.0      0.0
1    8.0    52.0   38.0   186.0   17.0   0.0    0.0  152.0  167.0      3.0
2   76.0    29.0   38.0    71.0   37.0   3.0    2.0   21.0    7.0     12.0
3  354.0   151.0  236.0   184.0  287.0   0.0  238.0  255.0   27.0    243.0
4    0.0    55.0   10.0    37.0    8.0   1.0    0.0   23.0   49.0      1.0

   APOC1  MT-CO1  TMSB10  EEF1A1  HLA-DRB1  HLA-DPA1  HLA-DPB1  MT-CO2  LYZ  \
0    0.0    29.0    22.0    34.0       0.0       0.0       1.0    21.0  0.0
1    0.0   109.0    65.0   128.0       2.0       4.0       1.0   172.0  1.0
2    3.0    46.0     7.0    11.0       0.0       7.0      17.0    16.0  0.0
3  196.0   118.0    45.0    98.0      66.0     115.0     160.0    27.0  6.0
4    0.0    11.0    25.0    10.0       1.0       2.0       2.0     6.0  0.0

   RPLP1
0   14.0
1  124.0
2   20.0
3   40.0
4   18.0

2.2 Cell Types (`labels`)

cell_type = rna["labels"]
print("Cell types:", cell_type.unique())
print(cell_type.head())
print("Sample counts:", cell_type.value_counts())

Cell types: ['T_cells_CD4+' 'Macrophages' 'T_cells_CD8+']
0    T_cells_CD4+
1    T_cells_CD4+
2     Macrophages
3     Macrophages
4    T_cells_CD4+
Name: standard_true_celltype_v5, dtype: object
Sample counts: standard_true_celltype_v5
T_cells_CD8+    162
T_cells_CD4+    139
Macrophages      99
Name: count, dtype: int64

3 Supervised Clustering with `PlnLDA`

This method performs a Poisson lognormal discriminant analysis (PLN-LDA), similar to classical LDA but adapted to count data via a latent Gaussian layer.

3.1 Model Formulation

Let \(c_i\) denote the known cluster label for sample \(i\):

\[ \begin{align} Z_i| c_i = k &\sim \mathcal{N}(X_i^\top B + \mu_{k}, \Sigma), \\ Y_{ij} &\sim \mathcal{P}(\exp(o_{ij} + Z_{ij})) \end{align} \]

3.2 Train-Test Split

We separate the data into training and test sets. The training set is used to fit the model, while the test set is used for evaluation.

train_prop = 0.8
n_train = int(endog.shape[0] * train_prop)
endog_train, endog_test = endog[:n_train], endog[n_train:]
cell_type_train, cell_type_test = cell_type[:n_train], cell_type[n_train:]

3.3 Fitting the PLN-LDA Model

from pyPLNmodels import PlnLDA
lda = PlnLDA(endog_train, clusters=cell_type_train).fit()
print(lda)

Setting the offsets to zero.
Fitting a PlnLDA model with full covariance.
Intializing parameters ...
Initialization finished.

Upper bound on the fitting time:   0%|          | 0/400 [00:00<?, ?it/s]Upper bound on the fitting time:  26%|██▌       | 102/400 [00:00<00:00, 1017.58it/s]Upper bound on the fitting time:  51%|█████     | 204/400 [00:00<00:00, 999.29it/s] Upper bound on the fitting time:  76%|███████▋  | 306/400 [00:00<00:00, 1006.35it/s]Upper bound on the fitting time: 100%|██████████| 400/400 [00:00<00:00, 1009.37it/s]

Maximum number of iterations (400)  reached in 1.2 seconds.
Last  criterion = 5.4e-06 . Required tolerance = 1e-06
A multivariate PlnLDA with full covariance.
======================================================================
     Loglike   Dimension    Nb param         BIC         AIC         ICL
   -17914.56          20         210  18520.2362  18124.5625      8088.1

======================================================================
* Useful attributes
    .latent_variables .latent_positions .coef .covariance .precision .model_parameters .latent_parameters .optim_details
* Useful methods
    .transform() .show() .predict() .sigma() .projected_latent_variables() .plot_correlation_circle() .biplot() .viz() .pca_pairplot() .plot_expected_vs_true()
* Additional attributes for PlnLDA are:
    .clusters .latent_positions_clusters
* Additional methods for PlnLDA are:
    .predict_clusters() .transform_new() .viz_transformed()

You can also define the model using R-style formulas as follows:

data_train = {"endog_train": endog_train, "clusters_train": cell_type_train}
_ = PlnLDA.from_formula("endog_train ~ 0 | clusters_train", data=data_train)

Setting the offsets to zero.

Use a pipe | to separate the exogenous variables from the clusters. See the dedicated tutorial for more details on how to specify a model.

3.4 Predicting on Test Set

pred_test = lda.predict_clusters(endog_test)

Setting the offsets to zero.
Doing a VE step.

Upper bound on the fitting time:   0%|          | 0/50 [00:00<?, ?it/s]Upper bound on the fitting time: 100%|██████████| 50/50 [00:00<00:00, 1410.83it/s]

Done!
Setting the offsets to zero.
Doing a VE step.

Upper bound on the fitting time:   0%|          | 0/50 [00:00<?, ?it/s]Upper bound on the fitting time: 100%|██████████| 50/50 [00:00<00:00, 1442.06it/s]

Done!
Setting the offsets to zero.
Doing a VE step.

Upper bound on the fitting time:   0%|          | 0/50 [00:00<?, ?it/s]Upper bound on the fitting time: 100%|██████████| 50/50 [00:00<00:00, 1452.76it/s]

Done!

from pyPLNmodels import plot_confusion_matrix
plot_confusion_matrix(pred_test, cell_type_test)

You can enhance performance by incorporating additional variables or increasing the number of samples in your dataset.

3.5 Visualizing the Latent Space

The .transform_new() method transforms unseen endogenous data into the latent space via LDA.

lda.viz_transformed(lda.transform_new(endog_test), colors=cell_type_test)

Setting the offsets to zero.
Doing a VE step.

Upper bound on the fitting time:   0%|          | 0/50 [00:00<?, ?it/s]Upper bound on the fitting time: 100%|██████████| 50/50 [00:00<00:00, 1217.64it/s]

Done!
Setting the offsets to zero.
Doing a VE step.

Upper bound on the fitting time:   0%|          | 0/50 [00:00<?, ?it/s]Upper bound on the fitting time: 100%|██████████| 50/50 [00:00<00:00, 1304.64it/s]

Done!
Setting the offsets to zero.
Doing a VE step.

Upper bound on the fitting time:   0%|          | 0/50 [00:00<?, ?it/s]Upper bound on the fitting time: 100%|██████████| 50/50 [00:00<00:00, 1414.64it/s]

Done!

lda.viz_transformed(lda.transform_new(endog_train), colors=cell_type_train)

Setting the offsets to zero.
Doing a VE step.

Upper bound on the fitting time:   0%|          | 0/50 [00:00<?, ?it/s]Upper bound on the fitting time: 100%|██████████| 50/50 [00:00<00:00, 1143.74it/s]

Done!
Setting the offsets to zero.
Doing a VE step.

Upper bound on the fitting time:   0%|          | 0/50 [00:00<?, ?it/s]Upper bound on the fitting time: 100%|██████████| 50/50 [00:00<00:00, 1146.37it/s]

Done!
Setting the offsets to zero.
Doing a VE step.

Upper bound on the fitting time:   0%|          | 0/50 [00:00<?, ?it/s]Upper bound on the fitting time: 100%|██████████| 50/50 [00:00<00:00, 1147.84it/s]

Done!

3.5.1 Optional: Exogenous Variables

You can include exogenous variables in the model; however, they will not be used for making predictions. When predicting on new data, ensure that the corresponding covariates for the samples are provided.

import torch
dumb_lda = PlnLDA(endog_train, exog=torch.randn(endog_train.shape[0], 2), clusters=cell_type_train).fit()
_ = dumb_lda.predict_clusters(endog_test, exog=torch.randn(endog_test.shape[0], 2))

Setting the offsets to zero.
Fitting a PlnLDA model with full covariance.
Intializing parameters ...
Initialization finished.

Upper bound on the fitting time:   0%|          | 0/400 [00:00<?, ?it/s]Upper bound on the fitting time:  25%|██▍       | 99/400 [00:00<00:00, 982.59it/s]Upper bound on the fitting time:  50%|█████     | 200/400 [00:00<00:00, 995.39it/s]Upper bound on the fitting time:  75%|███████▌  | 301/400 [00:00<00:00, 998.88it/s]Upper bound on the fitting time: 100%|██████████| 400/400 [00:00<00:00, 995.52it/s]

Maximum number of iterations (400)  reached in 0.4 seconds.
Last  criterion = 5.4e-06 . Required tolerance = 1e-06
Setting the offsets to zero.
Doing a VE step.

Upper bound on the fitting time:   0%|          | 0/50 [00:00<?, ?it/s]Upper bound on the fitting time: 100%|██████████| 50/50 [00:00<00:00, 1253.41it/s]

Done!
Setting the offsets to zero.
Doing a VE step.

Upper bound on the fitting time:   0%|          | 0/50 [00:00<?, ?it/s]Upper bound on the fitting time: 100%|██████████| 50/50 [00:00<00:00, 1243.13it/s]

Done!
Setting the offsets to zero.
Doing a VE step.

Upper bound on the fitting time:   0%|          | 0/50 [00:00<?, ?it/s]Upper bound on the fitting time: 100%|██████████| 50/50 [00:00<00:00, 1254.84it/s]

Done!

⚠️ Exogenous variables must be full rank, and an intercept will be added to account for the cluster bias.. Avoid one-hot encodings without intercept removal, like this:

try:
    wrong_lda = PlnLDA.from_formula("endog ~ 1 | labels", data = rna)
except ValueError as e:
    print(str(e))
try:
    wrong_lda = PlnLDA.from_formula("endog ~ labels | labels", data = rna)
except ValueError as e:
    print(str(e))

Setting the offsets to zero.
Input matrix (exog,clusters) does not result in (exog,clusters).T @(exog,clusters) being full rank (rank = 3, expected = 4). You may consider removing one or more variables or set add_const to False if that is not already the case. You can also set 0 + exog | clusters  in the formula to avoid adding an intercept.
Setting the offsets to zero.
Input matrix (exog,clusters) does not result in (exog,clusters).T @(exog,clusters) being full rank (rank = 3, expected = 6). You may consider removing one or more variables or set add_const to False if that is not already the case. You can also set 0 + exog | clusters  in the formula to avoid adding an intercept.

4 Unsupervised Clustering with `PlnMixture`

When no labels are known, PlnMixture fits a latent mixture model to identify subpopulations in an unsupervised way.

4.1 Model Formulation

\[ \begin{align} c_i &\sim \mathcal{M}(1, \pi), \\ Z_i \mid c_i=k &\sim \mathcal{N}(X_i^\top B + \mu_k, \Sigma_k), \\ Y_{ij} &\sim \mathcal{P}(\exp(o_{ij} + Z_{ij})) \end{align} \]

Covariance matrices \(\Sigma_k\) are assumed diagonal.

4.2 Model Fitting

from pyPLNmodels import PlnMixture
mixture = PlnMixture(endog, n_cluster=3).fit()
print(mixture)

Setting the offsets to zero.
Fitting a PlnMixture model with diagonal covariances and 3 clusters.
Intialization ...
Fitting a PlnDiag model with full covariance.
Intializing parameters ...
Initialization finished.

Upper bound on the fitting time:   0%|          | 0/150 [00:00<?, ?it/s]Upper bound on the fitting time:  72%|███████▏  | 108/150 [00:00<00:00, 1072.36it/s]Upper bound on the fitting time: 100%|██████████| 150/150 [00:00<00:00, 1072.58it/s]

Maximum number of iterations (150)  reached in 0.1 seconds.
Last  criterion = 0.01074308 . Required tolerance = 1e-06

Finished!

Upper bound on the fitting time:   0%|          | 0/400 [00:00<?, ?it/s]Upper bound on the fitting time:  12%|█▏        | 49/400 [00:00<00:00, 481.98it/s]Upper bound on the fitting time:  24%|██▍       | 98/400 [00:00<00:00, 471.17it/s]Upper bound on the fitting time:  36%|███▋      | 146/400 [00:00<00:00, 468.95it/s]Upper bound on the fitting time:  48%|████▊     | 194/400 [00:00<00:00, 469.85it/s]Upper bound on the fitting time:  60%|██████    | 241/400 [00:00<00:00, 284.87it/s]Upper bound on the fitting time:  73%|███████▎  | 293/400 [00:00<00:00, 339.64it/s]Upper bound on the fitting time:  86%|████████▌ | 342/400 [00:00<00:00, 377.22it/s]Upper bound on the fitting time:  98%|█████████▊| 390/400 [00:00<00:00, 402.95it/s]Upper bound on the fitting time: 100%|██████████| 400/400 [00:01<00:00, 390.06it/s]

Maximum number of iterations (400)  reached in 1.2 seconds.
Last  criterion = 5.37e-06 . Required tolerance = 1e-06
A multivariate PlnMixture with diagonal covariances and 3 clusters.
======================================================================
     Loglike   Dimension    Nb param         BIC         AIC         ICL  Silhouette
   -24320.87          20         120  24680.3536  24440.8657    11699.65  0.42457986

======================================================================
* Useful attributes
    .latent_variables .latent_positions .coef .covariance .precision .model_parameters .latent_parameters .optim_details
* Useful methods
    .transform() .show() .predict() .sigma() .projected_latent_variables() .plot_correlation_circle() .biplot() .viz() .pca_pairplot() .plot_expected_vs_true()
* Additional attributes for PlnMixture are:
    .covariances .cluster_bias .latent_prob .n_cluster .weights
* Additional methods for PlnMixture are:
    .predict_clusters()

4.2.1 Optional: With Covariates

_ = PlnMixture(endog, exog=torch.randn(endog.shape[0], 2), n_cluster=3)

Setting the offsets to zero.

⚠️ Exogenous variables must be full rank, and an intercept will be added to account for the cluster bias. Avoid one-hot encodings without intercept removal:

try:
    wrong_mixt = PlnMixture.from_formula("endog ~ 1", data = rna, n_cluster = 3)
except ValueError as e:
    print(str(e))
try:
    wrong_mixt = PlnMixture.from_formula("endog ~ labels", data = rna, n_cluster = 3)
except ValueError as e:
    print(str(e))
try:
    wrong_mixt = PlnMixture(endog = rna["endog"], exog = rna["labels_1hot"], n_cluster = 3)
except ValueError as e:
    print(str(e))
right_mixt = PlnMixture.from_formula("endog ~ 0", data = rna, n_cluster = 3)
right_mixt = PlnMixture(endog = rna["endog"], exog = rna["labels_1hot"].iloc[:,1:], n_cluster = 3)

Setting the offsets to zero.
Input matrix exog does not result in (exog,1).T @(exog,1) being full rank (rank = 1, expected = 2). This gives non identifiable cluster means. You may consider removing one or more variables in exog.
Setting the offsets to zero.
Input matrix exog does not result in (exog,1).T @(exog,1) being full rank (rank = 3, expected = 4). This gives non identifiable cluster means. You may consider removing one or more variables in exog.
Setting the offsets to zero.
Input matrix exog does not result in (exog,1).T @(exog,1) being full rank (rank = 3, expected = 4). This gives non identifiable cluster means. You may consider removing one or more variables in exog.
Setting the offsets to zero.
Setting the offsets to zero.

4.2.2 Extracting Weights

print(mixture.weights)

tensor([0.5004, 0.2524, 0.2471], dtype=torch.float64)

4.3 Visualizing and Evaluating Clusters

One can access the clusters of the samples using the .clusters attribute:

clusters = mixture.clusters

Some useful visualization and information about clusters are displayed using the .show() method:

mixture.show()

One can visualize the latent variables using the .viz() method. By default, the latent variables are colored by the inferred clusters, but other colors can be specified using the colors argument.

mixture.viz()

You can compare the inferred clusters with the actual cell types, although retrieving the exact cell types is not expected since the method is unsupervised.

To visualize the clustering results, use the following:

mixture.viz(colors=cell_type)

You can also assess the clustering performance with a confusion matrix:

plot_confusion_matrix(clusters, cell_type)

4.4 Predicting on New Data

You can predict unseen data using the .predict_clusters() method.

new_clusters = mixture.predict_clusters(endog)

Setting the offsets to zero.
Doing a VE step.
Intialization ...
Fitting a PlnDiag model with full covariance.
Intializing parameters ...
Initialization finished.

Upper bound on the fitting time:   0%|          | 0/150 [00:00<?, ?it/s]Upper bound on the fitting time:  68%|██████▊   | 102/150 [00:00<00:00, 1012.41it/s]Upper bound on the fitting time: 100%|██████████| 150/150 [00:00<00:00, 1029.15it/s]

Maximum number of iterations (150)  reached in 0.1 seconds.
Last  criterion = 0.01074308 . Required tolerance = 1e-06
Finished!

Upper bound on the fitting time:   0%|          | 0/30 [00:00<?, ?it/s]Upper bound on the fitting time: 100%|██████████| 30/30 [00:00<00:00, 616.05it/s]

Done!

The number of clusters has been arbitrary set to \(3\). A more data-driven approach is given by the PlnMixtureCollection in the next section.

4.5 Use `PlnMixtureCollection` for selecting the optimal number of clusters

To explore multiple cluster solutions, use PlnMixtureCollection:

Setting the offsets to zero.
Adjusting 6 PlnMixture models.

Fitting a PlnMixture model with diagonal covariances and 2 clusters.
Intialization ...
Fitting a PlnDiag model with full covariance.
Intializing parameters ...
Initialization finished.

Upper bound on the fitting time:   0%|          | 0/150 [00:00<?, ?it/s]Upper bound on the fitting time:  69%|██████▊   | 103/150 [00:00<00:00, 1029.30it/s]Upper bound on the fitting time: 100%|██████████| 150/150 [00:00<00:00, 1031.99it/s]

Maximum number of iterations (150)  reached in 0.1 seconds.
Last  criterion = 0.01074308 . Required tolerance = 1e-06
Finished!

Upper bound on the fitting time:   0%|          | 0/400 [00:00<?, ?it/s]Upper bound on the fitting time:  15%|█▍        | 59/400 [00:00<00:00, 585.59it/s]Upper bound on the fitting time:  30%|██▉       | 118/400 [00:00<00:00, 585.30it/s]Upper bound on the fitting time:  44%|████▍     | 177/400 [00:00<00:00, 585.99it/s]Upper bound on the fitting time:  59%|█████▉    | 236/400 [00:00<00:00, 584.68it/s]Upper bound on the fitting time:  74%|███████▍  | 295/400 [00:00<00:00, 583.16it/s]Upper bound on the fitting time:  88%|████████▊ | 354/400 [00:00<00:00, 584.29it/s]Upper bound on the fitting time: 100%|██████████| 400/400 [00:00<00:00, 583.97it/s]

Maximum number of iterations (400)  reached in 0.8 seconds.
Last  criterion = 5.36e-06 . Required tolerance = 1e-06
Fitting a PlnMixture model with diagonal covariances and 3 clusters.
Intialization ...
Fitting a PlnDiag model with full covariance.
Intializing parameters ...
Initialization finished.

Upper bound on the fitting time:   0%|          | 0/150 [00:00<?, ?it/s]Upper bound on the fitting time:  71%|███████   | 106/150 [00:00<00:00, 1051.31it/s]Upper bound on the fitting time: 100%|██████████| 150/150 [00:00<00:00, 1053.49it/s]

Maximum number of iterations (150)  reached in 0.1 seconds.
Last  criterion = 0.01074308 . Required tolerance = 1e-06
Finished!

Upper bound on the fitting time:   0%|          | 0/400 [00:00<?, ?it/s]Upper bound on the fitting time:  13%|█▎        | 52/400 [00:00<00:00, 515.42it/s]Upper bound on the fitting time:  26%|██▌       | 104/400 [00:00<00:00, 503.27it/s]Upper bound on the fitting time:  39%|███▉      | 155/400 [00:00<00:00, 501.31it/s]Upper bound on the fitting time:  52%|█████▏    | 206/400 [00:00<00:00, 499.96it/s]Upper bound on the fitting time:  64%|██████▍   | 257/400 [00:00<00:00, 499.32it/s]Upper bound on the fitting time:  77%|███████▋  | 307/400 [00:00<00:00, 499.52it/s]Upper bound on the fitting time:  89%|████████▉ | 357/400 [00:00<00:00, 499.14it/s]Upper bound on the fitting time: 100%|██████████| 400/400 [00:00<00:00, 499.41it/s]

Maximum number of iterations (400)  reached in 1.0 seconds.
Last  criterion = 5.37e-06 . Required tolerance = 1e-06
Fitting a PlnMixture model with diagonal covariances and 4 clusters.
Intialization ...
Fitting a PlnDiag model with full covariance.
Intializing parameters ...
Initialization finished.

Upper bound on the fitting time:   0%|          | 0/150 [00:00<?, ?it/s]Upper bound on the fitting time:  69%|██████▉   | 104/150 [00:00<00:00, 1034.52it/s]Upper bound on the fitting time: 100%|██████████| 150/150 [00:00<00:00, 1039.50it/s]

Maximum number of iterations (150)  reached in 0.1 seconds.
Last  criterion = 0.01074308 . Required tolerance = 1e-06
Finished!

Upper bound on the fitting time:   0%|          | 0/400 [00:00<?, ?it/s]Upper bound on the fitting time:  12%|█▏        | 47/400 [00:00<00:00, 460.26it/s]Upper bound on the fitting time:  24%|██▎       | 94/400 [00:00<00:00, 434.89it/s]Upper bound on the fitting time:  35%|███▍      | 139/400 [00:00<00:00, 438.71it/s]Upper bound on the fitting time:  46%|████▌     | 184/400 [00:00<00:00, 441.01it/s]Upper bound on the fitting time:  57%|█████▊    | 230/400 [00:00<00:00, 444.89it/s]Upper bound on the fitting time:  69%|██████▉   | 275/400 [00:00<00:00, 443.45it/s]Upper bound on the fitting time:  80%|████████  | 320/400 [00:00<00:00, 441.78it/s]Upper bound on the fitting time:  91%|█████████▏| 365/400 [00:00<00:00, 441.92it/s]Upper bound on the fitting time: 100%|██████████| 400/400 [00:00<00:00, 441.47it/s]

Maximum number of iterations (400)  reached in 1.1 seconds.
Last  criterion = 5.37e-06 . Required tolerance = 1e-06
Fitting a PlnMixture model with diagonal covariances and 5 clusters.
Intialization ...
Fitting a PlnDiag model with full covariance.
Intializing parameters ...
Initialization finished.

Upper bound on the fitting time:   0%|          | 0/150 [00:00<?, ?it/s]Upper bound on the fitting time:  71%|███████   | 106/150 [00:00<00:00, 1052.01it/s]Upper bound on the fitting time: 100%|██████████| 150/150 [00:00<00:00, 1049.85it/s]

Maximum number of iterations (150)  reached in 0.1 seconds.
Last  criterion = 0.01074308 . Required tolerance = 1e-06
Finished!

Upper bound on the fitting time:   0%|          | 0/400 [00:00<?, ?it/s]Upper bound on the fitting time:  12%|█▏        | 49/400 [00:00<00:00, 481.96it/s]Upper bound on the fitting time:  24%|██▍       | 98/400 [00:00<00:00, 482.21it/s]Upper bound on the fitting time:  37%|███▋      | 147/400 [00:00<00:00, 479.80it/s]Upper bound on the fitting time:  49%|████▉     | 195/400 [00:00<00:00, 479.87it/s]Upper bound on the fitting time:  61%|██████    | 244/400 [00:00<00:00, 482.78it/s]Upper bound on the fitting time:  73%|███████▎  | 293/400 [00:00<00:00, 485.18it/s]Upper bound on the fitting time:  86%|████████▌ | 342/400 [00:00<00:00, 485.82it/s]Upper bound on the fitting time:  98%|█████████▊| 391/400 [00:00<00:00, 486.14it/s]Upper bound on the fitting time: 100%|██████████| 400/400 [00:00<00:00, 483.63it/s]

Maximum number of iterations (400)  reached in 1.0 seconds.
Last  criterion = 5.37e-06 . Required tolerance = 1e-06
Fitting a PlnMixture model with diagonal covariances and 6 clusters.
Intialization ...
Fitting a PlnDiag model with full covariance.
Intializing parameters ...
Initialization finished.

Upper bound on the fitting time:   0%|          | 0/150 [00:00<?, ?it/s]Upper bound on the fitting time:  70%|███████   | 105/150 [00:00<00:00, 1044.85it/s]Upper bound on the fitting time: 100%|██████████| 150/150 [00:00<00:00, 1049.52it/s]

Maximum number of iterations (150)  reached in 0.1 seconds.
Last  criterion = 0.01074308 . Required tolerance = 1e-06
Finished!

Upper bound on the fitting time:   0%|          | 0/400 [00:00<?, ?it/s]Upper bound on the fitting time:  12%|█▏        | 46/400 [00:00<00:00, 454.97it/s]Upper bound on the fitting time:  23%|██▎       | 92/400 [00:00<00:00, 442.31it/s]Upper bound on the fitting time:  34%|███▍      | 137/400 [00:00<00:00, 438.68it/s]Upper bound on the fitting time:  46%|████▌     | 182/400 [00:00<00:00, 439.69it/s]Upper bound on the fitting time:  57%|█████▋    | 227/400 [00:00<00:00, 442.35it/s]Upper bound on the fitting time:  68%|██████▊   | 272/400 [00:00<00:00, 443.23it/s]Upper bound on the fitting time:  79%|███████▉  | 317/400 [00:00<00:00, 443.82it/s]Upper bound on the fitting time:  90%|█████████ | 362/400 [00:00<00:00, 443.97it/s]Upper bound on the fitting time: 100%|██████████| 400/400 [00:00<00:00, 442.71it/s]

Maximum number of iterations (400)  reached in 1.1 seconds.
Last  criterion = 5.37e-06 . Required tolerance = 1e-06
Fitting a PlnMixture model with diagonal covariances and 7 clusters.
Intialization ...
Fitting a PlnDiag model with full covariance.
Intializing parameters ...
Initialization finished.

Upper bound on the fitting time:   0%|          | 0/150 [00:00<?, ?it/s]Upper bound on the fitting time:  69%|██████▉   | 104/150 [00:00<00:00, 1035.77it/s]Upper bound on the fitting time: 100%|██████████| 150/150 [00:00<00:00, 1033.99it/s]

Maximum number of iterations (150)  reached in 0.1 seconds.
Last  criterion = 0.01074308 . Required tolerance = 1e-06
Finished!

Upper bound on the fitting time:   0%|          | 0/400 [00:00<?, ?it/s]Upper bound on the fitting time:   7%|▋         | 29/400 [00:00<00:01, 282.14it/s]Upper bound on the fitting time:  14%|█▍        | 58/400 [00:00<00:01, 272.19it/s]Upper bound on the fitting time:  22%|██▏       | 86/400 [00:00<00:01, 272.82it/s]Upper bound on the fitting time:  28%|██▊       | 114/400 [00:00<00:01, 273.85it/s]Upper bound on the fitting time:  36%|███▌      | 142/400 [00:00<00:00, 274.10it/s]Upper bound on the fitting time:  42%|████▎     | 170/400 [00:00<00:00, 274.32it/s]Upper bound on the fitting time:  50%|████▉     | 198/400 [00:00<00:00, 272.36it/s]Upper bound on the fitting time:  56%|█████▋    | 226/400 [00:00<00:00, 267.69it/s]Upper bound on the fitting time:  63%|██████▎   | 253/400 [00:00<00:00, 265.18it/s]Upper bound on the fitting time:  70%|███████   | 280/400 [00:01<00:00, 264.95it/s]Upper bound on the fitting time:  77%|███████▋  | 307/400 [00:01<00:00, 265.61it/s]Upper bound on the fitting time:  84%|████████▎ | 334/400 [00:01<00:00, 266.87it/s]Upper bound on the fitting time:  90%|█████████ | 362/400 [00:01<00:00, 268.37it/s]Upper bound on the fitting time:  98%|█████████▊| 390/400 [00:01<00:00, 269.65it/s]Upper bound on the fitting time: 100%|██████████| 400/400 [00:01<00:00, 269.62it/s]

Maximum number of iterations (400)  reached in 1.6 seconds.
Last  criterion = 5.37e-06 . Required tolerance = 1e-06
======================================================================

DONE!
Best model (lower BIC): n_cluster 7
    Best model(lower AIC): n_cluster  7

======================================================================

4.5.1 Evaluate model quality

collection.show(figsize=(8, 8))

This displays BIC, AIC, log-likelihood, and clustering metric (WCSS, silhouette) for each model.

Use the silhouette score (higher is better) or WCSS (lower is better) to determine the optimal number of clusters. In this example, 2 clusters may be appropriate.

⚠️ WCSS always decreases with the number of clusters, so that an ELBOW method must be used. Based on the graph and the WCSS metric, I would suggest 4 clusters.

The silhouette score is not sensitive to the number of clusters.

4.5.2 Select the best model

One can directly access the best model using the best_model method, with the metric of choice:

best_mixture = collection.best_model("silhouette") # Could be also BIC, AIC, or WCSS.
print(best_mixture)

A multivariate PlnMixture with diagonal covariances and 2 clusters.
======================================================================
     Loglike   Dimension    Nb param         BIC         AIC         ICL  Silhouette
   -25178.45          20          80  25418.1123  25258.4537    12101.03   0.5515394

======================================================================
* Useful attributes
    .latent_variables .latent_positions .coef .covariance .precision .model_parameters .latent_parameters .optim_details
* Useful methods
    .transform() .show() .predict() .sigma() .projected_latent_variables() .plot_correlation_circle() .biplot() .viz() .pca_pairplot() .plot_expected_vs_true()
* Additional attributes for PlnMixture are:
    .covariances .cluster_bias .latent_prob .n_cluster .weights
* Additional methods for PlnMixture are:
    .predict_clusters()

You can access specific models within the collection by using the cluster number as a key:

print(collection[3])

A multivariate PlnMixture with diagonal covariances and 3 clusters.
======================================================================
     Loglike   Dimension    Nb param         BIC         AIC         ICL  Silhouette
   -24320.76          20         120  24680.2439   24440.756    11700.15   0.4245733

======================================================================
* Useful attributes
    .latent_variables .latent_positions .coef .covariance .precision .model_parameters .latent_parameters .optim_details
* Useful methods
    .transform() .show() .predict() .sigma() .projected_latent_variables() .plot_correlation_circle() .biplot() .viz() .pca_pairplot() .plot_expected_vs_true()
* Additional attributes for PlnMixture are:
    .covariances .cluster_bias .latent_prob .n_cluster .weights
* Additional methods for PlnMixture are:
    .predict_clusters()

and loop through the collection:

for mixture in collection.values():
    print(mixture)

References

Batardière, Bastien, Joon Kwon, and Julien Chiquet. 2024. “pyPLNmodels: A Python Package to Analyze Multivariate High-Dimensional Count Data.” Journal of Open Source Software.