from pyPLNmodels import load_scrna
= load_scrna()
rna print('Data: ', rna.keys())
Returning scRNA dataset of size (400, 100)
Data: dict_keys(['endog', 'labels', 'labels_1hot'])
pyPLNmodels
In this tutorial, we will explore the fundamentals of multivariate count data analysis using the pyPLNmodels
package. This package includes two primary models for this purpose:
Pln
model (documentation)PlnPCA
model (documentation)The PlnPCA
model performs a similar function to the Pln
model but is designed to handle high-dimensional data, albeit with a slight compromise in parameter estimation accuracy.
Given a count matrix \(Y\), each model assumes the following:
\[ Y_{ij}| Z_{ij} \sim \mathcal P(\exp(Z_{ij})), \quad Z_{i}\sim \mathcal N(o_i + X_i^{\top} B, \Sigma),\] with input data
endog
) is the \(j\)-th count for the \(i\)-th observationexog
in the package) covariates for the \(i\)-th observation (if available)offsets
) offset for the \(i\)-th observation (if available)and model parameters
\(B\) (denoted coef
in the package) is a matrix of regression coefficients
\(\Sigma\) (denoted covariance
in the package) is the covariance matrix of the latent variables \(Z_i\)
The Pln
model assumes that \(\Sigma\) has full rank, whereas the PlnPCA
model assumes that \(\Sigma\) has a low rank, which must be specified by the user. A lower rank results in a greater compromise in parameter estimation.
The role of the pyPLNmodels
package is to estimate the parameters \(B\) and \(\Sigma\) using the input matrix \(Y\) and, if available, the covariate matrix \(X\) and offsets \(O\).
In this tutorial, we will analyse single-cell RNA-seq data available in the load_scrna
function of the package. In this dataset, each column corresponds to a gene, and each row (i.e. individual) corresponds to a cell. Covariates are available, corresponding to the cell type (labels
):
from pyPLNmodels import load_scrna
= load_scrna()
rna print('Data: ', rna.keys())
Returning scRNA dataset of size (400, 100)
Data: dict_keys(['endog', 'labels', 'labels_1hot'])
= rna["endog"]
endog print(endog.head())
FTL MALAT1 FTH1 TMSB4X CD74 SPP1 APOE B2M ACTB HLA-DRA \
0 5.0 39.0 14.0 39.0 11.0 0.0 0.0 88.0 49.0 0.0
1 8.0 52.0 38.0 186.0 17.0 0.0 0.0 152.0 167.0 3.0
2 76.0 29.0 38.0 71.0 37.0 3.0 2.0 21.0 7.0 12.0
3 354.0 151.0 236.0 184.0 287.0 0.0 238.0 255.0 27.0 243.0
4 0.0 55.0 10.0 37.0 8.0 1.0 0.0 23.0 49.0 1.0
... RPS15 ACTG1 RPL3 TIMP1 RPL15 RPL30 RPL12 CCL3L1 S100A8 NPC2
0 ... 3.0 16.0 6.0 0.0 8.0 8.0 4.0 0.0 0.0 1.0
1 ... 67.0 28.0 43.0 2.0 50.0 52.0 56.0 0.0 0.0 1.0
2 ... 6.0 2.0 5.0 0.0 9.0 5.0 9.0 0.0 0.0 4.0
3 ... 22.0 16.0 28.0 1.0 32.0 24.0 30.0 0.0 0.0 20.0
4 ... 9.0 4.0 7.0 0.0 11.0 4.0 3.0 0.0 0.0 1.0
[5 rows x 100 columns]
= rna["labels"]
cell_type print(cell_type.head())
0 T_cells_CD4+
1 T_cells_CD4+
2 Macrophages
3 Macrophages
4 T_cells_CD4+
Name: standard_true_celltype_v5, dtype: object
Working fine.
Content for the first subsection.
Content for the second section.
Write your conclusion here.