Multivariate count data analysis with pyPLNmodels

Published

March 27, 2025

In this tutorial, we will explore the fundamentals of multivariate count data analysis using the pyPLNmodels package. This package includes two primary models for this purpose:

The PlnPCA model performs a similar function to the Pln model but is designed to handle high-dimensional data, albeit with a slight compromise in parameter estimation accuracy.

Statistical background

Given a count matrix \(Y\), each model assumes the following:

\[ Y_{ij}| Z_{ij} \sim \mathcal P(\exp(Z_{ij})), \quad Z_{i}\sim \mathcal N(o_i + X_i^{\top} B, \Sigma),\] with input data

  • \(Y_{ij}\) (denoted endog) is the \(j\)-th count for the \(i\)-th observation
  • \(X_i\) (denoted exog in the package) covariates for the \(i\)-th observation (if available)
  • \(o_i\) (denoted offsets) offset for the \(i\)-th observation (if available)

and model parameters

  • \(B\) (denoted coef in the package) is a matrix of regression coefficients

  • \(\Sigma\) (denoted covariance in the package) is the covariance matrix of the latent variables \(Z_i\)

    The Pln model assumes that \(\Sigma\) has full rank, whereas the PlnPCA model assumes that \(\Sigma\) has a low rank, which must be specified by the user. A lower rank results in a greater compromise in parameter estimation.

Role

The role of the pyPLNmodels package is to estimate the parameters \(B\) and \(\Sigma\) using the input matrix \(Y\) and, if available, the covariate matrix \(X\) and offsets \(O\).

Data importation

In this tutorial, we will analyse single-cell RNA-seq data available in the load_scrna function of the package. In this dataset, each column corresponds to a gene, and each row (i.e. individual) corresponds to a cell. Covariates are available, corresponding to the cell type (labels):

from pyPLNmodels import load_scrna
rna = load_scrna()
print('Data: ', rna.keys())
Returning scRNA dataset of size (400, 100)
Data:  dict_keys(['endog', 'labels', 'labels_1hot'])

Data structure

endog = rna["endog"]
print(endog.head())
     FTL  MALAT1   FTH1  TMSB4X   CD74  SPP1   APOE    B2M   ACTB  HLA-DRA  \
0    5.0    39.0   14.0    39.0   11.0   0.0    0.0   88.0   49.0      0.0   
1    8.0    52.0   38.0   186.0   17.0   0.0    0.0  152.0  167.0      3.0   
2   76.0    29.0   38.0    71.0   37.0   3.0    2.0   21.0    7.0     12.0   
3  354.0   151.0  236.0   184.0  287.0   0.0  238.0  255.0   27.0    243.0   
4    0.0    55.0   10.0    37.0    8.0   1.0    0.0   23.0   49.0      1.0   

   ...  RPS15  ACTG1  RPL3  TIMP1  RPL15  RPL30  RPL12  CCL3L1  S100A8  NPC2  
0  ...    3.0   16.0   6.0    0.0    8.0    8.0    4.0     0.0     0.0   1.0  
1  ...   67.0   28.0  43.0    2.0   50.0   52.0   56.0     0.0     0.0   1.0  
2  ...    6.0    2.0   5.0    0.0    9.0    5.0    9.0     0.0     0.0   4.0  
3  ...   22.0   16.0  28.0    1.0   32.0   24.0   30.0     0.0     0.0  20.0  
4  ...    9.0    4.0   7.0    0.0   11.0    4.0    3.0     0.0     0.0   1.0  

[5 rows x 100 columns]
cell_type = rna["labels"]
print(cell_type.head())
0    T_cells_CD4+
1    T_cells_CD4+
2     Macrophages
3     Macrophages
4    T_cells_CD4+
Name: standard_true_celltype_v5, dtype: object

Section 1

Working fine.

Subsection 1.1

Content for the first subsection.

Section 2

Content for the second section.

Conclusion

Write your conclusion here.