from pyPLNmodels import load_scrna
= load_scrna(dim=10)
rna print('Data: ', rna.keys())
Returning scRNA dataset of size (400, 10)
Data: dict_keys(['endog', 'labels', 'labels_1hot'])
This guide introduces the basics of multivariate count data analysis using the pyPLNmodels
package. For complex datasets, traditional models like Poisson regression and Zero-Inflated Poisson Regression (ZIP) may fail to capture important insights, such as correlations between counts (i.e. variables). The pyPLNmodels
package provides two key models to address these challenges:
Pln
model (documentation)PlnPCA
model (documentation)The PlnPCA
model extends the functionality of the Pln
model to handle high-dimensional data, though it may slightly compromise parameter estimation accuracy.
Both models are built on the following assumptions for a given count matrix \(Y\) (Aitchison and Ho 1989):
\[ Y_{ij}| Z_{ij} \sim \mathcal P(\exp(o_{ij} + Z_{ij})), \quad Z_{i}\sim \mathcal N(X_i^{\top} B, \Sigma),\] where the input data includes:
endog
): the \(j\)-th count for the \(i\)-th observationexog
): covariates for the \(i\)-th observation (if available)offsets
): offset for the \(i\)-th observation (if available)and the model parameters are:
coef
): a matrix of regression coefficientscovariance
): the covariance matrix of the latent variables \(Z_i\)These models aim to capture the structure of the data through the latent variables \(Z\).
The Pln
model assumes \(\Sigma\) has full rank, while the PlnPCA
model assumes \(\Sigma\) has a low rank, which must be specified by the user. A lower rank introduces a trade-off, reducing computational complexity but potentially compromising parameter estimation accuracy.
The pyPLNmodels
package is designed to:
This is achieved using the input count matrix \(Y\), along with optional covariate matrix \(X\) (defaulting to a vector of 1s) and offsets \(O\) (defaulting to a matrix of 0s).
In this example, we analyze single-cell RNA-seq data provided by the load_scrna
function in the package. Each column in the dataset represents a gene, while each row corresponds to a cell (i.e., an individual). Covariates for cell types (labels
) are also included. For simplicity, we limit the analysis to \(10\) variables (dimensions).
from pyPLNmodels import load_scrna
= load_scrna(dim=10)
rna print('Data: ', rna.keys())
Returning scRNA dataset of size (400, 10)
Data: dict_keys(['endog', 'labels', 'labels_1hot'])
endog
)= rna["endog"]
endog print(endog.head())
FTL MALAT1 FTH1 TMSB4X CD74 SPP1 APOE B2M ACTB HLA-DRA
0 5.0 39.0 14.0 39.0 11.0 0.0 0.0 88.0 49.0 0.0
1 8.0 52.0 38.0 186.0 17.0 0.0 0.0 152.0 167.0 3.0
2 76.0 29.0 38.0 71.0 37.0 3.0 2.0 21.0 7.0 12.0
3 354.0 151.0 236.0 184.0 287.0 0.0 238.0 255.0 27.0 243.0
4 0.0 55.0 10.0 37.0 8.0 1.0 0.0 23.0 49.0 1.0
= rna["labels"]
cell_type print('Possible cell types: ', cell_type.unique())
print(cell_type.head())
Possible cell types: ['T_cells_CD4+' 'Macrophages' 'T_cells_CD8+']
0 T_cells_CD4+
1 T_cells_CD4+
2 Macrophages
3 Macrophages
4 T_cells_CD4+
Name: standard_true_celltype_v5, dtype: object
To analyze the mean values for each cell type, we use cell type as a covariate, with Macrophages
as the reference category:
from pyPLNmodels import Pln
= Pln.from_formula('endog ~ 1 + labels', data=rna).fit() pln
Setting the offsets to zero.
Fitting a Pln model with full covariance.
Intializing parameters ...
Initialization finished.
Upper bound on the fitting time: 0%| | 0/400 [00:00<?, ?it/s]Upper bound on the fitting time: 21%|██ | 83/400 [00:00<00:00, 826.25it/s]Upper bound on the fitting time: 42%|████▏ | 168/400 [00:00<00:00, 837.12it/s]Upper bound on the fitting time: 65%|██████▍ | 259/400 [00:00<00:00, 869.92it/s]Upper bound on the fitting time: 90%|█████████ | 360/400 [00:00<00:00, 922.16it/s]Upper bound on the fitting time: 100%|██████████| 400/400 [00:00<00:00, 903.95it/s]
Maximum number of iterations (400) reached in 1.3 seconds.
Last criterion = 5.4e-06 . Required tolerance = 1e-06
For more details on formula syntax and model initialization, including handling offsets, refer to the dedicated tutorial.
After fitting the model, you can print its configuration and key details:
print(pln)
A multivariate Pln with full covariance.
======================================================================
Loglike Dimension Nb param BIC AIC ICL
-12790.12 10 85 13044.7622 12875.125 6261.7
======================================================================
* Useful attributes
.latent_variables .latent_positions .coef .covariance .precision .model_parameters .latent_parameters .optim_details
* Useful methods
.transform() .show() .predict() .sigma() .projected_latent_variables() .plot_correlation_circle() .biplot() .viz() .pca_pairplot() .plot_expected_vs_true()
* Additional attributes for Pln are:
None
* Additional methods for Pln are:
.get_variance_coef() .get_confidence_interval_coef() .summary() .get_coef_p_values() .plot_regression_forest()
To gain deeper insights into the model parameters and the optimization process, use the .show()
method:
pln.show()
Monitoring the norm of each parameter is essential to assess model convergence. If the model has not converged, consider refitting it with additional iterations and a reduced tolerance (tol
). To adjust the number of iterations, use the maxiter
parameter:
=1000, tol = 0).show() pln.fit(maxiter
Fitting a Pln model with full covariance.
Upper bound on the fitting time: 0%| | 0/1000 [00:00<?, ?it/s]Upper bound on the fitting time: 11%|█ | 107/1000 [00:00<00:00, 1061.61it/s]Upper bound on the fitting time: 21%|██▏ | 214/1000 [00:00<00:00, 1056.11it/s]Upper bound on the fitting time: 32%|███▏ | 323/1000 [00:00<00:00, 1069.60it/s]Upper bound on the fitting time: 43%|████▎ | 431/1000 [00:00<00:00, 1073.67it/s]Upper bound on the fitting time: 54%|█████▍ | 539/1000 [00:00<00:00, 1075.09it/s]Upper bound on the fitting time: 65%|██████▍ | 647/1000 [00:00<00:00, 1055.11it/s]Upper bound on the fitting time: 75%|███████▌ | 753/1000 [00:00<00:00, 1041.14it/s]Upper bound on the fitting time: 86%|████████▌ | 858/1000 [00:00<00:00, 1022.65it/s]Upper bound on the fitting time: 96%|█████████▌| 961/1000 [00:00<00:00, 1024.86it/s]Upper bound on the fitting time: 100%|██████████| 1000/1000 [00:00<00:00, 1044.85it/s]
Maximum number of iterations (1400) reached in 2.2 seconds.
Last criterion = 0.0 . Required tolerance = 0
The latent variables \(Z\), which capture the underlying structure of the data, are accessible via the latent_variables
attribute, or the .transform()
method:
= pln.latent_variables
Z = pln.transform()
Z print('Shape of Z:', Z.shape)
Shape of Z: torch.Size([400, 10])
You can visualize these latent variables using the .viz()
method:
=cell_type) pln.viz(colors
By default the effect of covariates on the latent variables is included in the visualization. This means that the latent variables are represented as \(Z +
XB\). The effect of covariates on the latent variables can be removed by using the remove_exog_effect
keyword:
= pln.transform(remove_exog_effect=True) Z_moins_XB
To visualize the latent positions without the effect of covariates (i.e., (Z - XB)), set the remove_exog_effect
parameter to True
in the .viz()
method:
=cell_type, remove_exog_effect=True) pln.viz(colors
Additionally, you can generate a pair plot of the first Principal Components (PCs) of the latent variables:
=4, colors=cell_type) pln.pca_pairplot(n_components
The remove_exog_effect
parameter is also available in the pca_pairplot
method.
The model provides insights into the effects of covariates. For example, it may reveal that the mean value for Macrophages is higher compared to T_cells_CD4+ and T_cells_CD8+.
To summarize the model, including confidence intervals and p-values, use the summary
method:
pln.summary()
Coefficients and p-values per dimension:
Dimension: FTL
Exog Name Coefficient P-value
Intercept 4.408599 1e-16
labels[T.T_cells_CD4+] -3.255301 1e-16
labels[T.T_cells_CD8+] -3.704965 1e-16
Dimension: MALAT1
Exog Name Coefficient P-value
Intercept 4.496646 1e-16
labels[T.T_cells_CD4+] 0.044992 0.75
labels[T.T_cells_CD8+] -0.602403 7.8e-08
Dimension: FTH1
Exog Name Coefficient P-value
Intercept 3.989528 1e-16
labels[T.T_cells_CD4+] -2.125104 1e-16
labels[T.T_cells_CD8+] -3.118913 1e-16
Dimension: TMSB4X
Exog Name Coefficient P-value
Intercept 4.338097 1e-16
labels[T.T_cells_CD4+] -0.861660 6.8e-05
labels[T.T_cells_CD8+] -1.246957 1e-16
Dimension: CD74
Exog Name Coefficient P-value
Intercept 3.903079 1e-16
labels[T.T_cells_CD4+] -3.282655 1e-16
labels[T.T_cells_CD8+] -3.335481 1e-16
Dimension: SPP1
Exog Name Coefficient P-value
Intercept 0.209591 0.19
labels[T.T_cells_CD4+] -4.515602 1e-16
labels[T.T_cells_CD8+] -5.166453 1e-16
Dimension: APOE
Exog Name Coefficient P-value
Intercept 0.926648 2e-11
labels[T.T_cells_CD4+] -4.293190 2.2e-16
labels[T.T_cells_CD8+] -4.383797 1e-16
Dimension: B2M
Exog Name Coefficient P-value
Intercept 3.912995 1e-16
labels[T.T_cells_CD4+] -0.220007 0.67
labels[T.T_cells_CD8+] -0.583185 0.00023
Dimension: ACTB
Exog Name Coefficient P-value
Intercept 3.496265 1e-16
labels[T.T_cells_CD4+] -1.082953 0.00027
labels[T.T_cells_CD8+] -1.466583 3.5e-09
Dimension: HLA-DRA
Exog Name Coefficient P-value
Intercept 3.575849 1e-16
labels[T.T_cells_CD4+] -4.725000 1e-16
labels[T.T_cells_CD8+] -4.348214 1e-16
The p-value corresponds to the coding used in one-hot encoding, with Macrophages
set as the reference category.
You can also visualize confidence intervals for regression coefficients using the plot_regression_forest
method:
pln.plot_regression_forest()
The pyPLNmodels
package provides tools to analyze the effects of variables and visualize their relationships using the plot_correlation_circle()
and biplot()
methods.
Use the plot_correlation_circle()
method to visualize the relationships between variables:
=["FTL", "MALAT1", "FTH1"]) pln.plot_correlation_circle(column_names
You can access the column names using the column_names_endog
attribute of the model.
The biplot()
method allows simultaneous visualization of variables and latent variables ((Z)):
=["FTL", "MALAT1", "FTH1"], colors=cell_type) pln.biplot(column_names
PlnPCA
ModelThe PlnPCA
model (Chiquet, Mariadassou, and Robin 2018) extends the functionality of the Pln
model for high-dimensional data. It uses the same syntax as Pln
:
from pyPLNmodels import PlnPCA
= load_scrna(dim=500)
high_d_rna = PlnPCA.from_formula('endog ~ 1 + labels', data=high_d_rna, rank=5).fit() pca
Returning scRNA dataset of size (400, 500)
Setting the offsets to zero.
Fitting a PlnPCA model with 5 principal components.
Intializing parameters ...
Initialization finished.
Upper bound on the fitting time: 0%| | 0/400 [00:00<?, ?it/s]Upper bound on the fitting time: 12%|█▎ | 50/400 [00:00<00:00, 497.67it/s]Upper bound on the fitting time: 25%|██▌ | 100/400 [00:00<00:00, 480.72it/s]Upper bound on the fitting time: 37%|███▋ | 149/400 [00:00<00:00, 465.31it/s]Upper bound on the fitting time: 50%|████▉ | 198/400 [00:00<00:00, 472.47it/s]Upper bound on the fitting time: 62%|██████▎ | 250/400 [00:00<00:00, 487.23it/s]Upper bound on the fitting time: 76%|███████▌ | 302/400 [00:00<00:00, 497.51it/s]Upper bound on the fitting time: 89%|████████▉ | 355/400 [00:00<00:00, 505.50it/s]Upper bound on the fitting time: 100%|██████████| 400/400 [00:00<00:00, 495.90it/s]
Maximum number of iterations (400) reached in 1.0 seconds.
Last criterion = 5.4e-06 . Required tolerance = 1e-06
⚠️ Note: P-values are not available in the PlnPCA
model.
print(pca)
A multivariate PlnPCA with 5 principal components.
======================================================================
Loglike Dimension Nb param BIC AIC ICL
-330657.75 500 3990 342610.7218 334647.75 339757.06
======================================================================
* Useful attributes
.latent_variables .latent_positions .coef .covariance .precision .model_parameters .latent_parameters .optim_details
* Useful methods
.transform() .show() .predict() .sigma() .projected_latent_variables() .plot_correlation_circle() .biplot() .viz() .pca_pairplot() .plot_expected_vs_true()
* Additional attributes for PlnPCA are:
None
* Additional methods for PlnPCA are:
None
A low-dimensional of dimension rank
of the latent variables can be obtained using the project
keyword of the .transform()
method:
= pca.transform(project=True)
Z_low_dim print('Shape of Z_low_dim:', Z_low_dim.shape)
Shape of Z_low_dim: torch.Size([400, 5])
This model is particularly efficient for high-dimensional datasets, offering significantly reduced computation time compared to Pln
. See this paper for a computational comparison between Pln
and PlnPCA
The rank is a hyperparameter that can be specified by the user. Alternatively, a data-driven approach to rank selection can be performed using the PlnPCACollection
class, which fits multiple models with different ranks:
from pyPLNmodels import PlnPCACollection
= PlnPCACollection.from_formula('endog ~ 1 + labels', data=high_d_rna, ranks=[3, 5, 10, 15]).fit() pcas
Setting the offsets to zero.
Adjusting 4 PlnPCA models.
Fitting a PlnPCA model with 3 principal components.
Intializing parameters ...
Initialization finished.
Upper bound on the fitting time: 0%| | 0/400 [00:00<?, ?it/s]Upper bound on the fitting time: 13%|█▎ | 51/400 [00:00<00:00, 505.80it/s]Upper bound on the fitting time: 26%|██▋ | 106/400 [00:00<00:00, 529.37it/s]Upper bound on the fitting time: 40%|████ | 161/400 [00:00<00:00, 537.43it/s]Upper bound on the fitting time: 54%|█████▍ | 216/400 [00:00<00:00, 541.60it/s]Upper bound on the fitting time: 68%|██████▊ | 272/400 [00:00<00:00, 547.74it/s]Upper bound on the fitting time: 82%|████████▏ | 328/400 [00:00<00:00, 550.41it/s]Upper bound on the fitting time: 96%|█████████▌| 384/400 [00:00<00:00, 549.51it/s]Upper bound on the fitting time: 100%|██████████| 400/400 [00:00<00:00, 543.79it/s]
Maximum number of iterations (400) reached in 0.9 seconds.
Last criterion = 5.4e-06 . Required tolerance = 1e-06
Fitting a PlnPCA model with 5 principal components.
Intializing parameters ...
Initialization finished.
Upper bound on the fitting time: 0%| | 0/400 [00:00<?, ?it/s]Upper bound on the fitting time: 13%|█▎ | 52/400 [00:00<00:00, 517.32it/s]Upper bound on the fitting time: 26%|██▌ | 104/400 [00:00<00:00, 503.69it/s]Upper bound on the fitting time: 39%|███▉ | 155/400 [00:00<00:00, 493.49it/s]Upper bound on the fitting time: 51%|█████▏ | 205/400 [00:00<00:00, 467.65it/s]Upper bound on the fitting time: 63%|██████▎ | 252/400 [00:00<00:00, 462.31it/s]Upper bound on the fitting time: 75%|███████▍ | 299/400 [00:00<00:00, 463.64it/s]Upper bound on the fitting time: 87%|████████▋ | 348/400 [00:00<00:00, 471.12it/s]Upper bound on the fitting time: 100%|█████████▉| 399/400 [00:00<00:00, 483.13it/s]Upper bound on the fitting time: 100%|██████████| 400/400 [00:00<00:00, 478.59it/s]
Maximum number of iterations (400) reached in 0.8 seconds.
Last criterion = 5.4e-06 . Required tolerance = 1e-06
Fitting a PlnPCA model with 10 principal components.
Intializing parameters ...
Initialization finished.
Upper bound on the fitting time: 0%| | 0/400 [00:00<?, ?it/s]Upper bound on the fitting time: 12%|█▏ | 48/400 [00:00<00:00, 478.78it/s]Upper bound on the fitting time: 24%|██▍ | 96/400 [00:00<00:00, 476.02it/s]Upper bound on the fitting time: 36%|███▌ | 144/400 [00:00<00:00, 472.61it/s]Upper bound on the fitting time: 48%|████▊ | 192/400 [00:00<00:00, 472.11it/s]Upper bound on the fitting time: 60%|██████ | 240/400 [00:00<00:00, 473.42it/s]Upper bound on the fitting time: 72%|███████▏ | 288/400 [00:00<00:00, 473.25it/s]Upper bound on the fitting time: 84%|████████▍ | 336/400 [00:00<00:00, 469.16it/s]Upper bound on the fitting time: 96%|█████████▌| 384/400 [00:00<00:00, 472.00it/s]Upper bound on the fitting time: 100%|██████████| 400/400 [00:00<00:00, 471.98it/s]
Maximum number of iterations (400) reached in 0.8 seconds.
Last criterion = 5.4e-06 . Required tolerance = 1e-06
Fitting a PlnPCA model with 15 principal components.
Intializing parameters ...
Initialization finished.
Upper bound on the fitting time: 0%| | 0/400 [00:00<?, ?it/s]Upper bound on the fitting time: 12%|█▎ | 50/400 [00:00<00:00, 495.53it/s]Upper bound on the fitting time: 25%|██▌ | 100/400 [00:00<00:00, 493.94it/s]Upper bound on the fitting time: 38%|███▊ | 150/400 [00:00<00:00, 494.76it/s]Upper bound on the fitting time: 50%|█████ | 201/400 [00:00<00:00, 498.12it/s]Upper bound on the fitting time: 63%|██████▎ | 252/400 [00:00<00:00, 500.57it/s]Upper bound on the fitting time: 76%|███████▌ | 303/400 [00:00<00:00, 500.40it/s]Upper bound on the fitting time: 88%|████████▊ | 354/400 [00:00<00:00, 480.49it/s]Upper bound on the fitting time: 100%|██████████| 400/400 [00:00<00:00, 487.86it/s]
Maximum number of iterations (400) reached in 0.8 seconds.
Last criterion = 5.4e-06 . Required tolerance = 1e-06
======================================================================
DONE!
Best model (lower BIC): rank 15
Best model(lower AIC): rank 15
======================================================================
Use the .show()
method to explore insights about the optimal rank:
pcas.show()
⚠️ Note: The best model typically correspond to the largest rank, which may not always be desirable.
The best model can be accessed using the .best_model()
method. The selection criterion can be AIC
, BIC
, or ICL
:
= pcas.best_model(criterion="BIC")
best_model print(best_model)
A multivariate PlnPCA with 15 principal components.
======================================================================
Loglike Dimension Nb param BIC AIC ICL
-281405.0 500 8895 308052.0386 290300.0 299389.87
======================================================================
* Useful attributes
.latent_variables .latent_positions .coef .covariance .precision .model_parameters .latent_parameters .optim_details
* Useful methods
.transform() .show() .predict() .sigma() .projected_latent_variables() .plot_correlation_circle() .biplot() .viz() .pca_pairplot() .plot_expected_vs_true()
* Additional attributes for PlnPCA are:
None
* Additional methods for PlnPCA are:
None
All individual models in the collection can be accessed with the rank as the key:
= pcas[5]
pca_5 print(pca_5)
A multivariate PlnPCA with 5 principal components.
======================================================================
Loglike Dimension Nb param BIC AIC ICL
-330615.38 500 3990 342568.3468 334605.375 339716.13
======================================================================
* Useful attributes
.latent_variables .latent_positions .coef .covariance .precision .model_parameters .latent_parameters .optim_details
* Useful methods
.transform() .show() .predict() .sigma() .projected_latent_variables() .plot_correlation_circle() .biplot() .viz() .pca_pairplot() .plot_expected_vs_true()
* Additional attributes for PlnPCA are:
None
* Additional methods for PlnPCA are:
None
You can also iterate through the collection:
for pca in pcas.values():
print(pca)
A multivariate PlnPCA with 3 principal components.
======================================================================
Loglike Dimension Nb param BIC AIC ICL
-363016.88 500 2997 371995.0846 366013.875 370287.37
======================================================================
* Useful attributes
.latent_variables .latent_positions .coef .covariance .precision .model_parameters .latent_parameters .optim_details
* Useful methods
.transform() .show() .predict() .sigma() .projected_latent_variables() .plot_correlation_circle() .biplot() .viz() .pca_pairplot() .plot_expected_vs_true()
* Additional attributes for PlnPCA are:
None
* Additional methods for PlnPCA are:
None
A multivariate PlnPCA with 5 principal components.
======================================================================
Loglike Dimension Nb param BIC AIC ICL
-330615.38 500 3990 342568.3468 334605.375 339716.13
======================================================================
* Useful attributes
.latent_variables .latent_positions .coef .covariance .precision .model_parameters .latent_parameters .optim_details
* Useful methods
.transform() .show() .predict() .sigma() .projected_latent_variables() .plot_correlation_circle() .biplot() .viz() .pca_pairplot() .plot_expected_vs_true()
* Additional attributes for PlnPCA are:
None
* Additional methods for PlnPCA are:
None
A multivariate PlnPCA with 10 principal components.
======================================================================
Loglike Dimension Nb param BIC AIC ICL
-295788.5 500 6455 315125.9518 302243.5 309395.58
======================================================================
* Useful attributes
.latent_variables .latent_positions .coef .covariance .precision .model_parameters .latent_parameters .optim_details
* Useful methods
.transform() .show() .predict() .sigma() .projected_latent_variables() .plot_correlation_circle() .biplot() .viz() .pca_pairplot() .plot_expected_vs_true()
* Additional attributes for PlnPCA are:
None
* Additional methods for PlnPCA are:
None
A multivariate PlnPCA with 15 principal components.
======================================================================
Loglike Dimension Nb param BIC AIC ICL
-281405.0 500 8895 308052.0386 290300.0 299389.87
======================================================================
* Useful attributes
.latent_variables .latent_positions .coef .covariance .precision .model_parameters .latent_parameters .optim_details
* Useful methods
.transform() .show() .predict() .sigma() .projected_latent_variables() .plot_correlation_circle() .biplot() .viz() .pca_pairplot() .plot_expected_vs_true()
* Additional attributes for PlnPCA are:
None
* Additional methods for PlnPCA are:
None
For further details, print the model to explore its methods and attributes.