from pyPLNmodels import load_scrna
= load_scrna(dim = 20)
data = data["endog"] endog
Returning scRNA dataset of size (400, 20)
The PlnNetwork
model in the pyPLNmodels package is designed for sparse network inference from multivariate count data, using a penalized precision matrix. It allows for network structure learning between variables, such as genes in single-cell RNA sequencing datasets. Moreover, it can induce sparsity on the coefficients of the regression matrix.
This tutorial demonstrates how to:
Initialize a PlnNetwork
(documentation) model from count data or using a formula interface
Fit the model with regularization on the precision matrix
Visualize inferred network structures and latent variables
Gives tools for selecting the optimal penalty
Induce sparsity on the regression matrix
Compared to the Pln
model, a constraint is imposed on the precision matrix \(\Sigma^{-1}\) of the latent space to enforce sparsity (Chiquet, Robin, and Mariadassou (2019)):
\[ \begin{align} Z_i &\sim \mathcal{N}(X_i^{\top} B, \Sigma), \quad \|\Sigma^{-1}\|_1 \leq C \\ Y_{ij} \mid Z_{ij} &\sim \mathcal{P}(\exp(o_{ij} + Z_{ij})), \end{align} \] where \(C\) is a penalty parameter. See Chiquet, Robin, and Mariadassou (2019) for more details.
For visualization purposes, we use only 20 variables (i.e. genes).
from pyPLNmodels import load_scrna
= load_scrna(dim = 20)
data = data["endog"] endog
Returning scRNA dataset of size (400, 20)
A penalty needs to be specified for the model. The greater the penalty, the sparser the precision matrix will be and the lower the number of linked variables. We use a value of \(200\) for now.
from pyPLNmodels import PlnNetwork
= PlnNetwork(endog, penalty=200).fit() net
Setting the offsets to zero.
Fitting a PlnNetwork model with penalty 200 on the precision matrix and lasso penalty 0 on the regression coefficients.
Intializing parameters ...
Initialization finished.
Upper bound on the fitting time: 0%| | 0/400 [00:00<?, ?it/s]Upper bound on the fitting time: 18%|█▊ | 70/400 [00:00<00:00, 691.93it/s]Upper bound on the fitting time: 35%|███▌ | 140/400 [00:00<00:00, 682.99it/s]Upper bound on the fitting time: 52%|█████▎ | 210/400 [00:00<00:00, 688.57it/s]Upper bound on the fitting time: 70%|███████ | 280/400 [00:00<00:00, 692.55it/s]Upper bound on the fitting time: 88%|████████▊ | 351/400 [00:00<00:00, 697.08it/s]Upper bound on the fitting time: 100%|██████████| 400/400 [00:00<00:00, 694.52it/s]
Maximum number of iterations (400) reached in 1.4 seconds.
Last criterion = 5.4e-06 . Required tolerance = 1e-06
net.viz_network()
The network can be accessed as a dictionnary through the network
attribute:
= net.network
network print(network)
print("Genes associated with APOE: ", network["APOE"])
{'FTL': ['FTH1', 'SPP1', 'APOE', 'B2M', 'HLA-DRA', 'APOC1', 'TMSB10', 'HLA-DRB1', 'HLA-DPA1', 'HLA-DPB1', 'LYZ'], 'MALAT1': ['APOE', 'HLA-DRA', 'APOC1', 'EEF1A1'], 'FTH1': ['FTL', 'TMSB4X', 'APOE', 'B2M', 'ACTB', 'HLA-DRA', 'APOC1', 'TMSB10', 'EEF1A1', 'HLA-DPA1', 'HLA-DPB1', 'LYZ', 'RPLP1'], 'TMSB4X': ['FTH1', 'ACTB', 'HLA-DRA', 'EEF1A1', 'HLA-DRB1', 'HLA-DPA1', 'MT-CO2', 'LYZ', 'RPLP1'], 'CD74': ['SPP1', 'APOE', 'ACTB', 'HLA-DRA', 'MT-CO1', 'TMSB10', 'HLA-DRB1', 'HLA-DPA1', 'HLA-DPB1', 'LYZ', 'RPLP1'], 'SPP1': ['FTL', 'CD74', 'APOE', 'HLA-DRA', 'APOC1', 'LYZ'], 'APOE': ['FTL', 'MALAT1', 'FTH1', 'CD74', 'SPP1', 'HLA-DRA', 'APOC1', 'TMSB10', 'EEF1A1', 'HLA-DRB1', 'HLA-DPA1', 'HLA-DPB1', 'LYZ'], 'B2M': ['FTL', 'FTH1', 'HLA-DRA', 'TMSB10', 'HLA-DRB1'], 'ACTB': ['FTH1', 'TMSB4X', 'CD74', 'HLA-DRA', 'MT-CO1', 'EEF1A1', 'HLA-DRB1', 'HLA-DPA1', 'RPLP1'], 'HLA-DRA': ['FTL', 'MALAT1', 'FTH1', 'TMSB4X', 'CD74', 'SPP1', 'APOE', 'B2M', 'ACTB', 'APOC1', 'MT-CO1', 'TMSB10', 'EEF1A1', 'HLA-DRB1', 'HLA-DPA1', 'HLA-DPB1', 'MT-CO2', 'LYZ', 'RPLP1'], 'APOC1': ['FTL', 'MALAT1', 'FTH1', 'SPP1', 'APOE', 'HLA-DRA', 'MT-CO1', 'HLA-DRB1', 'HLA-DPB1', 'LYZ'], 'MT-CO1': ['CD74', 'ACTB', 'HLA-DRA', 'APOC1', 'TMSB10', 'EEF1A1', 'HLA-DPB1', 'LYZ'], 'TMSB10': ['FTL', 'FTH1', 'CD74', 'APOE', 'B2M', 'HLA-DRA', 'MT-CO1', 'EEF1A1', 'HLA-DRB1', 'HLA-DPB1', 'LYZ', 'RPLP1'], 'EEF1A1': ['MALAT1', 'FTH1', 'TMSB4X', 'APOE', 'ACTB', 'HLA-DRA', 'MT-CO1', 'TMSB10', 'HLA-DPA1', 'HLA-DPB1', 'RPLP1'], 'HLA-DRB1': ['FTL', 'TMSB4X', 'CD74', 'APOE', 'B2M', 'ACTB', 'HLA-DRA', 'APOC1', 'TMSB10', 'HLA-DPA1', 'HLA-DPB1', 'LYZ'], 'HLA-DPA1': ['FTL', 'FTH1', 'TMSB4X', 'CD74', 'APOE', 'ACTB', 'HLA-DRA', 'EEF1A1', 'HLA-DRB1', 'HLA-DPB1', 'MT-CO2', 'LYZ'], 'HLA-DPB1': ['FTL', 'FTH1', 'CD74', 'APOE', 'HLA-DRA', 'APOC1', 'MT-CO1', 'TMSB10', 'EEF1A1', 'HLA-DRB1', 'HLA-DPA1', 'LYZ', 'RPLP1'], 'MT-CO2': ['TMSB4X', 'HLA-DRA', 'HLA-DPA1'], 'LYZ': ['FTL', 'FTH1', 'TMSB4X', 'CD74', 'SPP1', 'APOE', 'HLA-DRA', 'APOC1', 'MT-CO1', 'TMSB10', 'HLA-DRB1', 'HLA-DPA1', 'HLA-DPB1', 'RPLP1'], 'RPLP1': ['FTH1', 'TMSB4X', 'CD74', 'ACTB', 'HLA-DRA', 'TMSB10', 'EEF1A1', 'HLA-DPB1', 'LYZ']}
Genes associated with APOE: ['FTL', 'MALAT1', 'FTH1', 'CD74', 'SPP1', 'HLA-DRA', 'APOC1', 'TMSB10', 'EEF1A1', 'HLA-DRB1', 'HLA-DPA1', 'HLA-DPB1', 'LYZ']
PlnNetworkCollection
for selecting the optimal penaltyThe penalty is an hyperparameter that needs to be tuned. To explore multiple penalty solutions, use PlnNetworkCollection
:
from pyPLNmodels import PlnNetworkCollection
= PlnNetworkCollection(endog, penalties=[0.1,10,1000]).fit() collection
Setting the offsets to zero.
Adjusting 3 PlnNetwork models.
Fitting a PlnNetwork model with penalty 0.1 on the precision matrix and lasso penalty 0 on the regression coefficients.
Intializing parameters ...
Initialization finished.
Upper bound on the fitting time: 0%| | 0/400 [00:00<?, ?it/s]Upper bound on the fitting time: 18%|█▊ | 71/400 [00:00<00:00, 703.92it/s]Upper bound on the fitting time: 36%|███▌ | 142/400 [00:00<00:00, 706.76it/s]Upper bound on the fitting time: 53%|█████▎ | 213/400 [00:00<00:00, 708.08it/s]Upper bound on the fitting time: 71%|███████ | 284/400 [00:00<00:00, 672.22it/s]Upper bound on the fitting time: 88%|████████▊ | 352/400 [00:00<00:00, 649.95it/s]Upper bound on the fitting time: 100%|██████████| 400/400 [00:00<00:00, 667.32it/s]
Maximum number of iterations (400) reached in 0.6 seconds.
Last criterion = 5.4e-06 . Required tolerance = 1e-06
Fitting a PlnNetwork model with penalty 10 on the precision matrix and lasso penalty 0 on the regression coefficients.
Intializing parameters ...
Initialization finished.
Upper bound on the fitting time: 0%| | 0/400 [00:00<?, ?it/s]Upper bound on the fitting time: 16%|█▋ | 66/400 [00:00<00:00, 657.66it/s]Upper bound on the fitting time: 33%|███▎ | 133/400 [00:00<00:00, 660.60it/s]Upper bound on the fitting time: 50%|█████ | 200/400 [00:00<00:00, 630.42it/s]Upper bound on the fitting time: 66%|██████▋ | 266/400 [00:00<00:00, 640.46it/s]Upper bound on the fitting time: 83%|████████▎ | 333/400 [00:00<00:00, 649.95it/s]Upper bound on the fitting time: 100%|██████████| 400/400 [00:00<00:00, 654.62it/s]
Maximum number of iterations (400) reached in 0.6 seconds.
Last criterion = 5.4e-06 . Required tolerance = 1e-06
Fitting a PlnNetwork model with penalty 1000 on the precision matrix and lasso penalty 0 on the regression coefficients.
Intializing parameters ...
Initialization finished.
Upper bound on the fitting time: 0%| | 0/400 [00:00<?, ?it/s]Upper bound on the fitting time: 17%|█▋ | 68/400 [00:00<00:00, 676.81it/s]Upper bound on the fitting time: 34%|███▍ | 136/400 [00:00<00:00, 673.03it/s]Upper bound on the fitting time: 51%|█████ | 204/400 [00:00<00:00, 659.33it/s]Upper bound on the fitting time: 68%|██████▊ | 272/400 [00:00<00:00, 665.27it/s]Upper bound on the fitting time: 85%|████████▌ | 341/400 [00:00<00:00, 672.94it/s]Upper bound on the fitting time: 100%|██████████| 400/400 [00:00<00:00, 671.85it/s]
Maximum number of iterations (400) reached in 0.6 seconds.
Last criterion = 5.4e-06 . Required tolerance = 1e-06
======================================================================
DONE!
Best model (lower BIC): penalty 0.1
Best model(lower AIC): penalty 0.1
======================================================================
=(8, 8)) collection.show(figsize
This displays the BIC, AIC and log-likelihood criteria for each model, helping to identify the most suitable one. Additionally, the number of links (i.e., non-zero elements in the precision matrix) is provided. You may consider using the ELBOW method to select the optimal model based on the number of links.
The best model can be chosen according to the BIC or AIC:
= collection.best_model("BIC") # Could be also AIC.
best_net print(best_net)
A multivariate PlnNetwork with penalty 0.1 on the precision matrix and lasso penalty 0 on the regression coefficients.
======================================================================
Loglike Dimension Nb param BIC AIC ICL Nb edges
-22940.33 20 227 23620.3633 23167.332 10461.07 187
======================================================================
* Useful attributes
.latent_variables .latent_positions .coef .covariance .precision .model_parameters .latent_parameters .optim_details
* Useful methods
.transform() .show() .predict() .sigma() .projected_latent_variables() .plot_correlation_circle() .biplot() .viz() .pca_pairplot() .plot_expected_vs_true()
* Additional attributes for PlnNetwork are:
.network .nb_zeros_precision
* Additional methods for PlnNetwork are:
.viz_network()
One can also access each individual model in the collection, using the penalty as a key:
print(collection[10])
A multivariate PlnNetwork with penalty 10 on the precision matrix and lasso penalty 0 on the regression coefficients.
======================================================================
Loglike Dimension Nb param BIC AIC ICL Nb edges
-23557.72 20 172 24072.9867 23729.7207 10589.69 132
======================================================================
* Useful attributes
.latent_variables .latent_positions .coef .covariance .precision .model_parameters .latent_parameters .optim_details
* Useful methods
.transform() .show() .predict() .sigma() .projected_latent_variables() .plot_correlation_circle() .biplot() .viz() .pca_pairplot() .plot_expected_vs_true()
* Additional attributes for PlnNetwork are:
.network .nb_zeros_precision
* Additional methods for PlnNetwork are:
.viz_network()
and loop through the collection:
for net in collection.values():
print(net)
By default, PlnNetwork
does not impose any sparsity on the regression matrix. To induce sparsity, set the penalty_coef
parameter to a positive float when initializing the model:
= PlnNetwork.from_formula("endog ~ 1 + labels", data = data, penalty_coef=20, penalty=200).fit()
net net.show()
Setting the offsets to zero.
Fitting a PlnNetwork model with penalty 200 on the precision matrix and lasso penalty 20 on the regression coefficients.
Intializing parameters ...
Initialization finished.
Upper bound on the fitting time: 0%| | 0/400 [00:00<?, ?it/s]Upper bound on the fitting time: 17%|█▋ | 68/400 [00:00<00:00, 672.44it/s]Upper bound on the fitting time: 34%|███▍ | 136/400 [00:00<00:00, 673.39it/s]Upper bound on the fitting time: 51%|█████ | 204/400 [00:00<00:00, 675.82it/s]Upper bound on the fitting time: 68%|██████▊ | 272/400 [00:00<00:00, 656.90it/s]Upper bound on the fitting time: 85%|████████▌ | 340/400 [00:00<00:00, 662.94it/s]Upper bound on the fitting time: 100%|██████████| 400/400 [00:00<00:00, 668.17it/s]
Maximum number of iterations (400) reached in 0.6 seconds.
Last criterion = 5.4e-06 . Required tolerance = 1e-06
The default sparsity is the lasso
penalty, but you can also use the group_lasso
and sparse_group_lasso
penalties. This can be done by specifying the penalty_coef_type
parameter when initializing the model:
= PlnNetwork.from_formula("endog ~ 1 + labels", data = data, penalty_coef_type="group_lasso", penalty_coef=50, penalty=200).fit()
net net.show()
Setting the offsets to zero.
Fitting a PlnNetwork model with penalty 200 on the precision matrix and group_lasso penalty 50 on the regression coefficients.
Intializing parameters ...
Initialization finished.
Upper bound on the fitting time: 0%| | 0/400 [00:00<?, ?it/s]Upper bound on the fitting time: 16%|█▌ | 64/400 [00:00<00:00, 631.56it/s]Upper bound on the fitting time: 33%|███▎ | 132/400 [00:00<00:00, 657.47it/s]Upper bound on the fitting time: 50%|█████ | 201/400 [00:00<00:00, 670.88it/s]Upper bound on the fitting time: 68%|██████▊ | 270/400 [00:00<00:00, 677.57it/s]Upper bound on the fitting time: 85%|████████▍ | 339/400 [00:00<00:00, 680.24it/s]Upper bound on the fitting time: 100%|██████████| 400/400 [00:00<00:00, 675.01it/s]
Maximum number of iterations (400) reached in 0.6 seconds.
Last criterion = 5.39e-06 . Required tolerance = 1e-06
The coefficients of the regression matrix are therefore sparse.