from pyPLNmodels import load_scrna
rna = load_scrna(dim=20)
print("Data keys:", rna.keys())Returning scRNA dataset of size (400, 20)
Data keys: dict_keys(['endog', 'labels', 'labels_1hot'])
The Poisson lognormal (PLN) model can be extended to uncover clustering structures in count data. Two main approaches are available:
PlnLDA, documentation).PlnMixture, documentation).We illustrate both using a single-cell RNA-seq dataset provided by the load_scrna function of the package. Each row represents a cell and each column a gene. Cell types are included as labels and used in the supervised setting.
from pyPLNmodels import load_scrna
rna = load_scrna(dim=20)
print("Data keys:", rna.keys())Returning scRNA dataset of size (400, 20)
Data keys: dict_keys(['endog', 'labels', 'labels_1hot'])
endog)endog = rna["endog"]
print(endog.head()) FTL MALAT1 FTH1 TMSB4X CD74 SPP1 APOE B2M ACTB HLA-DRA \
0 5.0 39.0 14.0 39.0 11.0 0.0 0.0 88.0 49.0 0.0
1 8.0 52.0 38.0 186.0 17.0 0.0 0.0 152.0 167.0 3.0
2 76.0 29.0 38.0 71.0 37.0 3.0 2.0 21.0 7.0 12.0
3 354.0 151.0 236.0 184.0 287.0 0.0 238.0 255.0 27.0 243.0
4 0.0 55.0 10.0 37.0 8.0 1.0 0.0 23.0 49.0 1.0
APOC1 MT-CO1 TMSB10 EEF1A1 HLA-DRB1 HLA-DPA1 HLA-DPB1 MT-CO2 LYZ \
0 0.0 29.0 22.0 34.0 0.0 0.0 1.0 21.0 0.0
1 0.0 109.0 65.0 128.0 2.0 4.0 1.0 172.0 1.0
2 3.0 46.0 7.0 11.0 0.0 7.0 17.0 16.0 0.0
3 196.0 118.0 45.0 98.0 66.0 115.0 160.0 27.0 6.0
4 0.0 11.0 25.0 10.0 1.0 2.0 2.0 6.0 0.0
RPLP1
0 14.0
1 124.0
2 20.0
3 40.0
4 18.0
labels)cell_type = rna["labels"]
print("Cell types:", cell_type.unique())
print(cell_type.head())
print("Sample counts:", cell_type.value_counts())Cell types: ['T_cells_CD4+' 'Macrophages' 'T_cells_CD8+']
0 T_cells_CD4+
1 T_cells_CD4+
2 Macrophages
3 Macrophages
4 T_cells_CD4+
Name: standard_true_celltype_v5, dtype: object
Sample counts: standard_true_celltype_v5
T_cells_CD8+ 162
T_cells_CD4+ 139
Macrophages 99
Name: count, dtype: int64
PlnLDAThis method performs a Poisson lognormal discriminant analysis (PLN-LDA), similar to classical LDA but adapted to count data via a latent Gaussian layer.
Let \(c_i\) denote the known cluster label for sample \(i\):
\[ \begin{align} Z_i| c_i = k &\sim \mathcal{N}(X_i^\top B + \mu_{k}, \Sigma), \\ Y_{ij} &\sim \mathcal{P}(\exp(o_{ij} + Z_{ij})) \end{align} \]
We separate the data into training and test sets. The training set is used to fit the model, while the test set is used for evaluation.
train_prop = 0.8
n_train = int(endog.shape[0] * train_prop)
endog_train, endog_test = endog[:n_train], endog[n_train:]
cell_type_train, cell_type_test = cell_type[:n_train], cell_type[n_train:]from pyPLNmodels import PlnLDA
lda = PlnLDA(endog_train, clusters=cell_type_train).fit()
print(lda)Setting the offsets to zero.
Fitting a PlnLDA model with full covariance.
Intializing parameters ...
Initialization finished.
Upper bound on the fitting time: 0%| | 0/400 [00:00<?, ?it/s]Upper bound on the fitting time: 26%|██▌ | 102/400 [00:00<00:00, 1017.58it/s]Upper bound on the fitting time: 51%|█████ | 204/400 [00:00<00:00, 999.29it/s] Upper bound on the fitting time: 76%|███████▋ | 306/400 [00:00<00:00, 1006.35it/s]Upper bound on the fitting time: 100%|██████████| 400/400 [00:00<00:00, 1009.37it/s]
Maximum number of iterations (400) reached in 1.2 seconds.
Last criterion = 5.4e-06 . Required tolerance = 1e-06
A multivariate PlnLDA with full covariance.
======================================================================
Loglike Dimension Nb param BIC AIC ICL
-17914.56 20 210 18520.2362 18124.5625 8088.1
======================================================================
* Useful attributes
.latent_variables .latent_positions .coef .covariance .precision .model_parameters .latent_parameters .optim_details
* Useful methods
.transform() .show() .predict() .sigma() .projected_latent_variables() .plot_correlation_circle() .biplot() .viz() .pca_pairplot() .plot_expected_vs_true()
* Additional attributes for PlnLDA are:
.clusters .latent_positions_clusters
* Additional methods for PlnLDA are:
.predict_clusters() .transform_new() .viz_transformed()
You can also define the model using R-style formulas as follows:
data_train = {"endog_train": endog_train, "clusters_train": cell_type_train}
_ = PlnLDA.from_formula("endog_train ~ 0 | clusters_train", data=data_train)Setting the offsets to zero.
Use a pipe | to separate the exogenous variables from the clusters. See the dedicated tutorial for more details on how to specify a model.
pred_test = lda.predict_clusters(endog_test)Setting the offsets to zero.
Doing a VE step.
Upper bound on the fitting time: 0%| | 0/50 [00:00<?, ?it/s]Upper bound on the fitting time: 100%|██████████| 50/50 [00:00<00:00, 1410.83it/s]
Done!
Setting the offsets to zero.
Doing a VE step.
Upper bound on the fitting time: 0%| | 0/50 [00:00<?, ?it/s]Upper bound on the fitting time: 100%|██████████| 50/50 [00:00<00:00, 1442.06it/s]
Done!
Setting the offsets to zero.
Doing a VE step.
Upper bound on the fitting time: 0%| | 0/50 [00:00<?, ?it/s]Upper bound on the fitting time: 100%|██████████| 50/50 [00:00<00:00, 1452.76it/s]
Done!
from pyPLNmodels import plot_confusion_matrix
plot_confusion_matrix(pred_test, cell_type_test)You can enhance performance by incorporating additional variables or increasing the number of samples in your dataset.
The .transform_new() method transforms unseen endogenous data into the latent space via LDA.
lda.viz_transformed(lda.transform_new(endog_test), colors=cell_type_test)Setting the offsets to zero.
Doing a VE step.
Upper bound on the fitting time: 0%| | 0/50 [00:00<?, ?it/s]Upper bound on the fitting time: 100%|██████████| 50/50 [00:00<00:00, 1217.64it/s]
Done!
Setting the offsets to zero.
Doing a VE step.
Upper bound on the fitting time: 0%| | 0/50 [00:00<?, ?it/s]Upper bound on the fitting time: 100%|██████████| 50/50 [00:00<00:00, 1304.64it/s]
Done!
Setting the offsets to zero.
Doing a VE step.
Upper bound on the fitting time: 0%| | 0/50 [00:00<?, ?it/s]Upper bound on the fitting time: 100%|██████████| 50/50 [00:00<00:00, 1414.64it/s]
Done!
lda.viz_transformed(lda.transform_new(endog_train), colors=cell_type_train)Setting the offsets to zero.
Doing a VE step.
Upper bound on the fitting time: 0%| | 0/50 [00:00<?, ?it/s]Upper bound on the fitting time: 100%|██████████| 50/50 [00:00<00:00, 1143.74it/s]
Done!
Setting the offsets to zero.
Doing a VE step.
Upper bound on the fitting time: 0%| | 0/50 [00:00<?, ?it/s]Upper bound on the fitting time: 100%|██████████| 50/50 [00:00<00:00, 1146.37it/s]
Done!
Setting the offsets to zero.
Doing a VE step.
Upper bound on the fitting time: 0%| | 0/50 [00:00<?, ?it/s]Upper bound on the fitting time: 100%|██████████| 50/50 [00:00<00:00, 1147.84it/s]
Done!
You can include exogenous variables in the model; however, they will not be used for making predictions. When predicting on new data, ensure that the corresponding covariates for the samples are provided.
import torch
dumb_lda = PlnLDA(endog_train, exog=torch.randn(endog_train.shape[0], 2), clusters=cell_type_train).fit()
_ = dumb_lda.predict_clusters(endog_test, exog=torch.randn(endog_test.shape[0], 2))Setting the offsets to zero.
Fitting a PlnLDA model with full covariance.
Intializing parameters ...
Initialization finished.
Upper bound on the fitting time: 0%| | 0/400 [00:00<?, ?it/s]Upper bound on the fitting time: 25%|██▍ | 99/400 [00:00<00:00, 982.59it/s]Upper bound on the fitting time: 50%|█████ | 200/400 [00:00<00:00, 995.39it/s]Upper bound on the fitting time: 75%|███████▌ | 301/400 [00:00<00:00, 998.88it/s]Upper bound on the fitting time: 100%|██████████| 400/400 [00:00<00:00, 995.52it/s]
Maximum number of iterations (400) reached in 0.4 seconds.
Last criterion = 5.4e-06 . Required tolerance = 1e-06
Setting the offsets to zero.
Doing a VE step.
Upper bound on the fitting time: 0%| | 0/50 [00:00<?, ?it/s]Upper bound on the fitting time: 100%|██████████| 50/50 [00:00<00:00, 1253.41it/s]
Done!
Setting the offsets to zero.
Doing a VE step.
Upper bound on the fitting time: 0%| | 0/50 [00:00<?, ?it/s]Upper bound on the fitting time: 100%|██████████| 50/50 [00:00<00:00, 1243.13it/s]
Done!
Setting the offsets to zero.
Doing a VE step.
Upper bound on the fitting time: 0%| | 0/50 [00:00<?, ?it/s]Upper bound on the fitting time: 100%|██████████| 50/50 [00:00<00:00, 1254.84it/s]
Done!
⚠️ Exogenous variables must be full rank, and an intercept will be added to account for the cluster bias.. Avoid one-hot encodings without intercept removal, like this:
try:
wrong_lda = PlnLDA.from_formula("endog ~ 1 | labels", data = rna)
except ValueError as e:
print(str(e))
try:
wrong_lda = PlnLDA.from_formula("endog ~ labels | labels", data = rna)
except ValueError as e:
print(str(e))Setting the offsets to zero.
Input matrix (exog,clusters) does not result in (exog,clusters).T @(exog,clusters) being full rank (rank = 3, expected = 4). You may consider removing one or more variables or set add_const to False if that is not already the case. You can also set 0 + exog | clusters in the formula to avoid adding an intercept.
Setting the offsets to zero.
Input matrix (exog,clusters) does not result in (exog,clusters).T @(exog,clusters) being full rank (rank = 3, expected = 6). You may consider removing one or more variables or set add_const to False if that is not already the case. You can also set 0 + exog | clusters in the formula to avoid adding an intercept.
PlnMixtureWhen no labels are known, PlnMixture fits a latent mixture model to identify subpopulations in an unsupervised way.
\[ \begin{align} c_i &\sim \mathcal{M}(1, \pi), \\ Z_i \mid c_i=k &\sim \mathcal{N}(X_i^\top B + \mu_k, \Sigma_k), \\ Y_{ij} &\sim \mathcal{P}(\exp(o_{ij} + Z_{ij})) \end{align} \]
Covariance matrices \(\Sigma_k\) are assumed diagonal.
from pyPLNmodels import PlnMixture
mixture = PlnMixture(endog, n_cluster=3).fit()
print(mixture)Setting the offsets to zero.
Fitting a PlnMixture model with diagonal covariances and 3 clusters.
Intialization ...
Fitting a PlnDiag model with full covariance.
Intializing parameters ...
Initialization finished.
Upper bound on the fitting time: 0%| | 0/150 [00:00<?, ?it/s]Upper bound on the fitting time: 72%|███████▏ | 108/150 [00:00<00:00, 1072.36it/s]Upper bound on the fitting time: 100%|██████████| 150/150 [00:00<00:00, 1072.58it/s]
Maximum number of iterations (150) reached in 0.1 seconds.
Last criterion = 0.01074308 . Required tolerance = 1e-06
Finished!
Upper bound on the fitting time: 0%| | 0/400 [00:00<?, ?it/s]Upper bound on the fitting time: 12%|█▏ | 49/400 [00:00<00:00, 481.98it/s]Upper bound on the fitting time: 24%|██▍ | 98/400 [00:00<00:00, 471.17it/s]Upper bound on the fitting time: 36%|███▋ | 146/400 [00:00<00:00, 468.95it/s]Upper bound on the fitting time: 48%|████▊ | 194/400 [00:00<00:00, 469.85it/s]Upper bound on the fitting time: 60%|██████ | 241/400 [00:00<00:00, 284.87it/s]Upper bound on the fitting time: 73%|███████▎ | 293/400 [00:00<00:00, 339.64it/s]Upper bound on the fitting time: 86%|████████▌ | 342/400 [00:00<00:00, 377.22it/s]Upper bound on the fitting time: 98%|█████████▊| 390/400 [00:00<00:00, 402.95it/s]Upper bound on the fitting time: 100%|██████████| 400/400 [00:01<00:00, 390.06it/s]
Maximum number of iterations (400) reached in 1.2 seconds.
Last criterion = 5.37e-06 . Required tolerance = 1e-06
A multivariate PlnMixture with diagonal covariances and 3 clusters.
======================================================================
Loglike Dimension Nb param BIC AIC ICL Silhouette
-24320.87 20 120 24680.3536 24440.8657 11699.65 0.42457986
======================================================================
* Useful attributes
.latent_variables .latent_positions .coef .covariance .precision .model_parameters .latent_parameters .optim_details
* Useful methods
.transform() .show() .predict() .sigma() .projected_latent_variables() .plot_correlation_circle() .biplot() .viz() .pca_pairplot() .plot_expected_vs_true()
* Additional attributes for PlnMixture are:
.covariances .cluster_bias .latent_prob .n_cluster .weights
* Additional methods for PlnMixture are:
.predict_clusters()
_ = PlnMixture(endog, exog=torch.randn(endog.shape[0], 2), n_cluster=3)Setting the offsets to zero.
⚠️ Exogenous variables must be full rank, and an intercept will be added to account for the cluster bias. Avoid one-hot encodings without intercept removal:
try:
wrong_mixt = PlnMixture.from_formula("endog ~ 1", data = rna, n_cluster = 3)
except ValueError as e:
print(str(e))
try:
wrong_mixt = PlnMixture.from_formula("endog ~ labels", data = rna, n_cluster = 3)
except ValueError as e:
print(str(e))
try:
wrong_mixt = PlnMixture(endog = rna["endog"], exog = rna["labels_1hot"], n_cluster = 3)
except ValueError as e:
print(str(e))
right_mixt = PlnMixture.from_formula("endog ~ 0", data = rna, n_cluster = 3)
right_mixt = PlnMixture(endog = rna["endog"], exog = rna["labels_1hot"].iloc[:,1:], n_cluster = 3)Setting the offsets to zero.
Input matrix exog does not result in (exog,1).T @(exog,1) being full rank (rank = 1, expected = 2). This gives non identifiable cluster means. You may consider removing one or more variables in exog.
Setting the offsets to zero.
Input matrix exog does not result in (exog,1).T @(exog,1) being full rank (rank = 3, expected = 4). This gives non identifiable cluster means. You may consider removing one or more variables in exog.
Setting the offsets to zero.
Input matrix exog does not result in (exog,1).T @(exog,1) being full rank (rank = 3, expected = 4). This gives non identifiable cluster means. You may consider removing one or more variables in exog.
Setting the offsets to zero.
Setting the offsets to zero.
print(mixture.weights)tensor([0.5004, 0.2524, 0.2471], dtype=torch.float64)
One can access the clusters of the samples using the .clusters attribute:
clusters = mixture.clustersSome useful visualization and information about clusters are displayed using the .show() method:
mixture.show()One can visualize the latent variables using the .viz() method. By default, the latent variables are colored by the inferred clusters, but other colors can be specified using the colors argument.
mixture.viz()You can compare the inferred clusters with the actual cell types, although retrieving the exact cell types is not expected since the method is unsupervised.
To visualize the clustering results, use the following:
mixture.viz(colors=cell_type)You can also assess the clustering performance with a confusion matrix:
plot_confusion_matrix(clusters, cell_type)You can predict unseen data using the .predict_clusters() method.
new_clusters = mixture.predict_clusters(endog)Setting the offsets to zero.
Doing a VE step.
Intialization ...
Fitting a PlnDiag model with full covariance.
Intializing parameters ...
Initialization finished.
Upper bound on the fitting time: 0%| | 0/150 [00:00<?, ?it/s]Upper bound on the fitting time: 68%|██████▊ | 102/150 [00:00<00:00, 1012.41it/s]Upper bound on the fitting time: 100%|██████████| 150/150 [00:00<00:00, 1029.15it/s]
Maximum number of iterations (150) reached in 0.1 seconds.
Last criterion = 0.01074308 . Required tolerance = 1e-06
Finished!
Upper bound on the fitting time: 0%| | 0/30 [00:00<?, ?it/s]Upper bound on the fitting time: 100%|██████████| 30/30 [00:00<00:00, 616.05it/s]
Done!
The number of clusters has been arbitrary set to \(3\). A more data-driven approach is given by the PlnMixtureCollection in the next section.
PlnMixtureCollection for selecting the optimal number of clustersTo explore multiple cluster solutions, use PlnMixtureCollection:
Setting the offsets to zero.
Adjusting 6 PlnMixture models.
Fitting a PlnMixture model with diagonal covariances and 2 clusters.
Intialization ...
Fitting a PlnDiag model with full covariance.
Intializing parameters ...
Initialization finished.
Upper bound on the fitting time: 0%| | 0/150 [00:00<?, ?it/s]Upper bound on the fitting time: 69%|██████▊ | 103/150 [00:00<00:00, 1029.30it/s]Upper bound on the fitting time: 100%|██████████| 150/150 [00:00<00:00, 1031.99it/s]
Maximum number of iterations (150) reached in 0.1 seconds.
Last criterion = 0.01074308 . Required tolerance = 1e-06
Finished!
Upper bound on the fitting time: 0%| | 0/400 [00:00<?, ?it/s]Upper bound on the fitting time: 15%|█▍ | 59/400 [00:00<00:00, 585.59it/s]Upper bound on the fitting time: 30%|██▉ | 118/400 [00:00<00:00, 585.30it/s]Upper bound on the fitting time: 44%|████▍ | 177/400 [00:00<00:00, 585.99it/s]Upper bound on the fitting time: 59%|█████▉ | 236/400 [00:00<00:00, 584.68it/s]Upper bound on the fitting time: 74%|███████▍ | 295/400 [00:00<00:00, 583.16it/s]Upper bound on the fitting time: 88%|████████▊ | 354/400 [00:00<00:00, 584.29it/s]Upper bound on the fitting time: 100%|██████████| 400/400 [00:00<00:00, 583.97it/s]
Maximum number of iterations (400) reached in 0.8 seconds.
Last criterion = 5.36e-06 . Required tolerance = 1e-06
Fitting a PlnMixture model with diagonal covariances and 3 clusters.
Intialization ...
Fitting a PlnDiag model with full covariance.
Intializing parameters ...
Initialization finished.
Upper bound on the fitting time: 0%| | 0/150 [00:00<?, ?it/s]Upper bound on the fitting time: 71%|███████ | 106/150 [00:00<00:00, 1051.31it/s]Upper bound on the fitting time: 100%|██████████| 150/150 [00:00<00:00, 1053.49it/s]
Maximum number of iterations (150) reached in 0.1 seconds.
Last criterion = 0.01074308 . Required tolerance = 1e-06
Finished!
Upper bound on the fitting time: 0%| | 0/400 [00:00<?, ?it/s]Upper bound on the fitting time: 13%|█▎ | 52/400 [00:00<00:00, 515.42it/s]Upper bound on the fitting time: 26%|██▌ | 104/400 [00:00<00:00, 503.27it/s]Upper bound on the fitting time: 39%|███▉ | 155/400 [00:00<00:00, 501.31it/s]Upper bound on the fitting time: 52%|█████▏ | 206/400 [00:00<00:00, 499.96it/s]Upper bound on the fitting time: 64%|██████▍ | 257/400 [00:00<00:00, 499.32it/s]Upper bound on the fitting time: 77%|███████▋ | 307/400 [00:00<00:00, 499.52it/s]Upper bound on the fitting time: 89%|████████▉ | 357/400 [00:00<00:00, 499.14it/s]Upper bound on the fitting time: 100%|██████████| 400/400 [00:00<00:00, 499.41it/s]
Maximum number of iterations (400) reached in 1.0 seconds.
Last criterion = 5.37e-06 . Required tolerance = 1e-06
Fitting a PlnMixture model with diagonal covariances and 4 clusters.
Intialization ...
Fitting a PlnDiag model with full covariance.
Intializing parameters ...
Initialization finished.
Upper bound on the fitting time: 0%| | 0/150 [00:00<?, ?it/s]Upper bound on the fitting time: 69%|██████▉ | 104/150 [00:00<00:00, 1034.52it/s]Upper bound on the fitting time: 100%|██████████| 150/150 [00:00<00:00, 1039.50it/s]
Maximum number of iterations (150) reached in 0.1 seconds.
Last criterion = 0.01074308 . Required tolerance = 1e-06
Finished!
Upper bound on the fitting time: 0%| | 0/400 [00:00<?, ?it/s]Upper bound on the fitting time: 12%|█▏ | 47/400 [00:00<00:00, 460.26it/s]Upper bound on the fitting time: 24%|██▎ | 94/400 [00:00<00:00, 434.89it/s]Upper bound on the fitting time: 35%|███▍ | 139/400 [00:00<00:00, 438.71it/s]Upper bound on the fitting time: 46%|████▌ | 184/400 [00:00<00:00, 441.01it/s]Upper bound on the fitting time: 57%|█████▊ | 230/400 [00:00<00:00, 444.89it/s]Upper bound on the fitting time: 69%|██████▉ | 275/400 [00:00<00:00, 443.45it/s]Upper bound on the fitting time: 80%|████████ | 320/400 [00:00<00:00, 441.78it/s]Upper bound on the fitting time: 91%|█████████▏| 365/400 [00:00<00:00, 441.92it/s]Upper bound on the fitting time: 100%|██████████| 400/400 [00:00<00:00, 441.47it/s]
Maximum number of iterations (400) reached in 1.1 seconds.
Last criterion = 5.37e-06 . Required tolerance = 1e-06
Fitting a PlnMixture model with diagonal covariances and 5 clusters.
Intialization ...
Fitting a PlnDiag model with full covariance.
Intializing parameters ...
Initialization finished.
Upper bound on the fitting time: 0%| | 0/150 [00:00<?, ?it/s]Upper bound on the fitting time: 71%|███████ | 106/150 [00:00<00:00, 1052.01it/s]Upper bound on the fitting time: 100%|██████████| 150/150 [00:00<00:00, 1049.85it/s]
Maximum number of iterations (150) reached in 0.1 seconds.
Last criterion = 0.01074308 . Required tolerance = 1e-06
Finished!
Upper bound on the fitting time: 0%| | 0/400 [00:00<?, ?it/s]Upper bound on the fitting time: 12%|█▏ | 49/400 [00:00<00:00, 481.96it/s]Upper bound on the fitting time: 24%|██▍ | 98/400 [00:00<00:00, 482.21it/s]Upper bound on the fitting time: 37%|███▋ | 147/400 [00:00<00:00, 479.80it/s]Upper bound on the fitting time: 49%|████▉ | 195/400 [00:00<00:00, 479.87it/s]Upper bound on the fitting time: 61%|██████ | 244/400 [00:00<00:00, 482.78it/s]Upper bound on the fitting time: 73%|███████▎ | 293/400 [00:00<00:00, 485.18it/s]Upper bound on the fitting time: 86%|████████▌ | 342/400 [00:00<00:00, 485.82it/s]Upper bound on the fitting time: 98%|█████████▊| 391/400 [00:00<00:00, 486.14it/s]Upper bound on the fitting time: 100%|██████████| 400/400 [00:00<00:00, 483.63it/s]
Maximum number of iterations (400) reached in 1.0 seconds.
Last criterion = 5.37e-06 . Required tolerance = 1e-06
Fitting a PlnMixture model with diagonal covariances and 6 clusters.
Intialization ...
Fitting a PlnDiag model with full covariance.
Intializing parameters ...
Initialization finished.
Upper bound on the fitting time: 0%| | 0/150 [00:00<?, ?it/s]Upper bound on the fitting time: 70%|███████ | 105/150 [00:00<00:00, 1044.85it/s]Upper bound on the fitting time: 100%|██████████| 150/150 [00:00<00:00, 1049.52it/s]
Maximum number of iterations (150) reached in 0.1 seconds.
Last criterion = 0.01074308 . Required tolerance = 1e-06
Finished!
Upper bound on the fitting time: 0%| | 0/400 [00:00<?, ?it/s]Upper bound on the fitting time: 12%|█▏ | 46/400 [00:00<00:00, 454.97it/s]Upper bound on the fitting time: 23%|██▎ | 92/400 [00:00<00:00, 442.31it/s]Upper bound on the fitting time: 34%|███▍ | 137/400 [00:00<00:00, 438.68it/s]Upper bound on the fitting time: 46%|████▌ | 182/400 [00:00<00:00, 439.69it/s]Upper bound on the fitting time: 57%|█████▋ | 227/400 [00:00<00:00, 442.35it/s]Upper bound on the fitting time: 68%|██████▊ | 272/400 [00:00<00:00, 443.23it/s]Upper bound on the fitting time: 79%|███████▉ | 317/400 [00:00<00:00, 443.82it/s]Upper bound on the fitting time: 90%|█████████ | 362/400 [00:00<00:00, 443.97it/s]Upper bound on the fitting time: 100%|██████████| 400/400 [00:00<00:00, 442.71it/s]
Maximum number of iterations (400) reached in 1.1 seconds.
Last criterion = 5.37e-06 . Required tolerance = 1e-06
Fitting a PlnMixture model with diagonal covariances and 7 clusters.
Intialization ...
Fitting a PlnDiag model with full covariance.
Intializing parameters ...
Initialization finished.
Upper bound on the fitting time: 0%| | 0/150 [00:00<?, ?it/s]Upper bound on the fitting time: 69%|██████▉ | 104/150 [00:00<00:00, 1035.77it/s]Upper bound on the fitting time: 100%|██████████| 150/150 [00:00<00:00, 1033.99it/s]
Maximum number of iterations (150) reached in 0.1 seconds.
Last criterion = 0.01074308 . Required tolerance = 1e-06
Finished!
Upper bound on the fitting time: 0%| | 0/400 [00:00<?, ?it/s]Upper bound on the fitting time: 7%|▋ | 29/400 [00:00<00:01, 282.14it/s]Upper bound on the fitting time: 14%|█▍ | 58/400 [00:00<00:01, 272.19it/s]Upper bound on the fitting time: 22%|██▏ | 86/400 [00:00<00:01, 272.82it/s]Upper bound on the fitting time: 28%|██▊ | 114/400 [00:00<00:01, 273.85it/s]Upper bound on the fitting time: 36%|███▌ | 142/400 [00:00<00:00, 274.10it/s]Upper bound on the fitting time: 42%|████▎ | 170/400 [00:00<00:00, 274.32it/s]Upper bound on the fitting time: 50%|████▉ | 198/400 [00:00<00:00, 272.36it/s]Upper bound on the fitting time: 56%|█████▋ | 226/400 [00:00<00:00, 267.69it/s]Upper bound on the fitting time: 63%|██████▎ | 253/400 [00:00<00:00, 265.18it/s]Upper bound on the fitting time: 70%|███████ | 280/400 [00:01<00:00, 264.95it/s]Upper bound on the fitting time: 77%|███████▋ | 307/400 [00:01<00:00, 265.61it/s]Upper bound on the fitting time: 84%|████████▎ | 334/400 [00:01<00:00, 266.87it/s]Upper bound on the fitting time: 90%|█████████ | 362/400 [00:01<00:00, 268.37it/s]Upper bound on the fitting time: 98%|█████████▊| 390/400 [00:01<00:00, 269.65it/s]Upper bound on the fitting time: 100%|██████████| 400/400 [00:01<00:00, 269.62it/s]
Maximum number of iterations (400) reached in 1.6 seconds.
Last criterion = 5.37e-06 . Required tolerance = 1e-06
======================================================================
DONE!
Best model (lower BIC): n_cluster 7
Best model(lower AIC): n_cluster 7
======================================================================
collection.show(figsize=(8, 8))This displays BIC, AIC, log-likelihood, and clustering metric (WCSS, silhouette) for each model.
Use the silhouette score (higher is better) or WCSS (lower is better) to determine the optimal number of clusters. In this example, 2 clusters may be appropriate.
⚠️ WCSS always decreases with the number of clusters, so that an ELBOW method must be used. Based on the graph and the WCSS metric, I would suggest 4 clusters.
The silhouette score is not sensitive to the number of clusters.
One can directly access the best model using the best_model method, with the metric of choice:
best_mixture = collection.best_model("silhouette") # Could be also BIC, AIC, or WCSS.
print(best_mixture)A multivariate PlnMixture with diagonal covariances and 2 clusters.
======================================================================
Loglike Dimension Nb param BIC AIC ICL Silhouette
-25178.45 20 80 25418.1123 25258.4537 12101.03 0.5515394
======================================================================
* Useful attributes
.latent_variables .latent_positions .coef .covariance .precision .model_parameters .latent_parameters .optim_details
* Useful methods
.transform() .show() .predict() .sigma() .projected_latent_variables() .plot_correlation_circle() .biplot() .viz() .pca_pairplot() .plot_expected_vs_true()
* Additional attributes for PlnMixture are:
.covariances .cluster_bias .latent_prob .n_cluster .weights
* Additional methods for PlnMixture are:
.predict_clusters()
You can access specific models within the collection by using the cluster number as a key:
print(collection[3])A multivariate PlnMixture with diagonal covariances and 3 clusters.
======================================================================
Loglike Dimension Nb param BIC AIC ICL Silhouette
-24320.76 20 120 24680.2439 24440.756 11700.15 0.4245733
======================================================================
* Useful attributes
.latent_variables .latent_positions .coef .covariance .precision .model_parameters .latent_parameters .optim_details
* Useful methods
.transform() .show() .predict() .sigma() .projected_latent_variables() .plot_correlation_circle() .biplot() .viz() .pca_pairplot() .plot_expected_vs_true()
* Additional attributes for PlnMixture are:
.covariances .cluster_bias .latent_prob .n_cluster .weights
* Additional methods for PlnMixture are:
.predict_clusters()
and loop through the collection:
for mixture in collection.values():
print(mixture)