Network inference 🌐

Published

March 27, 2025

1 Introduction

The PlnNetwork model in the pyPLNmodels package is designed for sparse network inference from multivariate count data, using a penalized precision matrix. It allows for network structure learning between variables, such as genes in single-cell RNA sequencing datasets. Moreover, it can induce sparsity on the coefficients of the regression matrix.

This tutorial demonstrates how to:

  • Initialize a PlnNetwork (documentation) model from count data or using a formula interface

  • Fit the model with regularization on the precision matrix

  • Visualize inferred network structures and latent variables

  • Gives tools for selecting the optimal penalty

  • Induce sparsity on the regression matrix

1.1 Statistical background

Compared to the Pln model, a constraint is imposed on the precision matrix \(\Sigma^{-1}\) of the latent space to enforce sparsity (Chiquet, Robin, and Mariadassou (2019)):

\[ \begin{align} Z_i &\sim \mathcal{N}(X_i^{\top} B, \Sigma), \quad \|\Sigma^{-1}\|_1 \leq C \\ Y_{ij} \mid Z_{ij} &\sim \mathcal{P}(\exp(o_{ij} + Z_{ij})), \end{align} \] where \(C\) is a penalty parameter. See Chiquet, Robin, and Mariadassou (2019) for more details.

1.2 Data importation

For visualization purposes, we use only 20 variables (i.e. genes).

from pyPLNmodels import load_scrna
data = load_scrna(dim = 20)
endog = data["endog"]
Returning scRNA dataset of size (400, 20)

2 Model initialization and fitting

A penalty needs to be specified for the model. The greater the penalty, the sparser the precision matrix will be and the lower the number of linked variables. We use a value of \(200\) for now.

from pyPLNmodels import PlnNetwork
net = PlnNetwork(endog, penalty=200).fit()
Setting the offsets to zero.
Fitting a PlnNetwork model with penalty 200 on the precision matrix and lasso  penalty 0 on the regression coefficients.
Intializing parameters ...
Initialization finished.
Upper bound on the fitting time:   0%|          | 0/400 [00:00<?, ?it/s]Upper bound on the fitting time:  18%|█▊        | 70/400 [00:00<00:00, 691.93it/s]Upper bound on the fitting time:  35%|███▌      | 140/400 [00:00<00:00, 682.99it/s]Upper bound on the fitting time:  52%|█████▎    | 210/400 [00:00<00:00, 688.57it/s]Upper bound on the fitting time:  70%|███████   | 280/400 [00:00<00:00, 692.55it/s]Upper bound on the fitting time:  88%|████████▊ | 351/400 [00:00<00:00, 697.08it/s]Upper bound on the fitting time: 100%|██████████| 400/400 [00:00<00:00, 694.52it/s]
Maximum number of iterations (400)  reached in 1.4 seconds.
Last  criterion = 5.4e-06 . Required tolerance = 1e-06

3 Network visualization

net.viz_network()

The network can be accessed as a dictionnary through the network attribute:

network = net.network
print(network)
print("Genes associated with APOE: ", network["APOE"])
{'FTL': ['FTH1', 'SPP1', 'APOE', 'B2M', 'HLA-DRA', 'APOC1', 'TMSB10', 'HLA-DRB1', 'HLA-DPA1', 'HLA-DPB1', 'LYZ'], 'MALAT1': ['APOE', 'HLA-DRA', 'APOC1', 'EEF1A1'], 'FTH1': ['FTL', 'TMSB4X', 'APOE', 'B2M', 'ACTB', 'HLA-DRA', 'APOC1', 'TMSB10', 'EEF1A1', 'HLA-DPA1', 'HLA-DPB1', 'LYZ', 'RPLP1'], 'TMSB4X': ['FTH1', 'ACTB', 'HLA-DRA', 'EEF1A1', 'HLA-DRB1', 'HLA-DPA1', 'MT-CO2', 'LYZ', 'RPLP1'], 'CD74': ['SPP1', 'APOE', 'ACTB', 'HLA-DRA', 'MT-CO1', 'TMSB10', 'HLA-DRB1', 'HLA-DPA1', 'HLA-DPB1', 'LYZ', 'RPLP1'], 'SPP1': ['FTL', 'CD74', 'APOE', 'HLA-DRA', 'APOC1', 'LYZ'], 'APOE': ['FTL', 'MALAT1', 'FTH1', 'CD74', 'SPP1', 'HLA-DRA', 'APOC1', 'TMSB10', 'EEF1A1', 'HLA-DRB1', 'HLA-DPA1', 'HLA-DPB1', 'LYZ'], 'B2M': ['FTL', 'FTH1', 'HLA-DRA', 'TMSB10', 'HLA-DRB1'], 'ACTB': ['FTH1', 'TMSB4X', 'CD74', 'HLA-DRA', 'MT-CO1', 'EEF1A1', 'HLA-DRB1', 'HLA-DPA1', 'RPLP1'], 'HLA-DRA': ['FTL', 'MALAT1', 'FTH1', 'TMSB4X', 'CD74', 'SPP1', 'APOE', 'B2M', 'ACTB', 'APOC1', 'MT-CO1', 'TMSB10', 'EEF1A1', 'HLA-DRB1', 'HLA-DPA1', 'HLA-DPB1', 'MT-CO2', 'LYZ', 'RPLP1'], 'APOC1': ['FTL', 'MALAT1', 'FTH1', 'SPP1', 'APOE', 'HLA-DRA', 'MT-CO1', 'HLA-DRB1', 'HLA-DPB1', 'LYZ'], 'MT-CO1': ['CD74', 'ACTB', 'HLA-DRA', 'APOC1', 'TMSB10', 'EEF1A1', 'HLA-DPB1', 'LYZ'], 'TMSB10': ['FTL', 'FTH1', 'CD74', 'APOE', 'B2M', 'HLA-DRA', 'MT-CO1', 'EEF1A1', 'HLA-DRB1', 'HLA-DPB1', 'LYZ', 'RPLP1'], 'EEF1A1': ['MALAT1', 'FTH1', 'TMSB4X', 'APOE', 'ACTB', 'HLA-DRA', 'MT-CO1', 'TMSB10', 'HLA-DPA1', 'HLA-DPB1', 'RPLP1'], 'HLA-DRB1': ['FTL', 'TMSB4X', 'CD74', 'APOE', 'B2M', 'ACTB', 'HLA-DRA', 'APOC1', 'TMSB10', 'HLA-DPA1', 'HLA-DPB1', 'LYZ'], 'HLA-DPA1': ['FTL', 'FTH1', 'TMSB4X', 'CD74', 'APOE', 'ACTB', 'HLA-DRA', 'EEF1A1', 'HLA-DRB1', 'HLA-DPB1', 'MT-CO2', 'LYZ'], 'HLA-DPB1': ['FTL', 'FTH1', 'CD74', 'APOE', 'HLA-DRA', 'APOC1', 'MT-CO1', 'TMSB10', 'EEF1A1', 'HLA-DRB1', 'HLA-DPA1', 'LYZ', 'RPLP1'], 'MT-CO2': ['TMSB4X', 'HLA-DRA', 'HLA-DPA1'], 'LYZ': ['FTL', 'FTH1', 'TMSB4X', 'CD74', 'SPP1', 'APOE', 'HLA-DRA', 'APOC1', 'MT-CO1', 'TMSB10', 'HLA-DRB1', 'HLA-DPA1', 'HLA-DPB1', 'RPLP1'], 'RPLP1': ['FTH1', 'TMSB4X', 'CD74', 'ACTB', 'HLA-DRA', 'TMSB10', 'EEF1A1', 'HLA-DPB1', 'LYZ']}
Genes associated with APOE:  ['FTL', 'MALAT1', 'FTH1', 'CD74', 'SPP1', 'HLA-DRA', 'APOC1', 'TMSB10', 'EEF1A1', 'HLA-DRB1', 'HLA-DPA1', 'HLA-DPB1', 'LYZ']

4 Use PlnNetworkCollection for selecting the optimal penalty

The penalty is an hyperparameter that needs to be tuned. To explore multiple penalty solutions, use PlnNetworkCollection:

from pyPLNmodels import PlnNetworkCollection
collection = PlnNetworkCollection(endog, penalties=[0.1,10,1000]).fit()
Setting the offsets to zero.
Adjusting 3 PlnNetwork models.

Fitting a PlnNetwork model with penalty 0.1 on the precision matrix and lasso  penalty 0 on the regression coefficients.
Intializing parameters ...
Initialization finished.
Upper bound on the fitting time:   0%|          | 0/400 [00:00<?, ?it/s]Upper bound on the fitting time:  18%|█▊        | 71/400 [00:00<00:00, 703.92it/s]Upper bound on the fitting time:  36%|███▌      | 142/400 [00:00<00:00, 706.76it/s]Upper bound on the fitting time:  53%|█████▎    | 213/400 [00:00<00:00, 708.08it/s]Upper bound on the fitting time:  71%|███████   | 284/400 [00:00<00:00, 672.22it/s]Upper bound on the fitting time:  88%|████████▊ | 352/400 [00:00<00:00, 649.95it/s]Upper bound on the fitting time: 100%|██████████| 400/400 [00:00<00:00, 667.32it/s]
Maximum number of iterations (400)  reached in 0.6 seconds.
Last  criterion = 5.4e-06 . Required tolerance = 1e-06
Fitting a PlnNetwork model with penalty 10 on the precision matrix and lasso  penalty 0 on the regression coefficients.
Intializing parameters ...
Initialization finished.
Upper bound on the fitting time:   0%|          | 0/400 [00:00<?, ?it/s]Upper bound on the fitting time:  16%|█▋        | 66/400 [00:00<00:00, 657.66it/s]Upper bound on the fitting time:  33%|███▎      | 133/400 [00:00<00:00, 660.60it/s]Upper bound on the fitting time:  50%|█████     | 200/400 [00:00<00:00, 630.42it/s]Upper bound on the fitting time:  66%|██████▋   | 266/400 [00:00<00:00, 640.46it/s]Upper bound on the fitting time:  83%|████████▎ | 333/400 [00:00<00:00, 649.95it/s]Upper bound on the fitting time: 100%|██████████| 400/400 [00:00<00:00, 654.62it/s]
Maximum number of iterations (400)  reached in 0.6 seconds.
Last  criterion = 5.4e-06 . Required tolerance = 1e-06
Fitting a PlnNetwork model with penalty 1000 on the precision matrix and lasso  penalty 0 on the regression coefficients.
Intializing parameters ...
Initialization finished.
Upper bound on the fitting time:   0%|          | 0/400 [00:00<?, ?it/s]Upper bound on the fitting time:  17%|█▋        | 68/400 [00:00<00:00, 676.81it/s]Upper bound on the fitting time:  34%|███▍      | 136/400 [00:00<00:00, 673.03it/s]Upper bound on the fitting time:  51%|█████     | 204/400 [00:00<00:00, 659.33it/s]Upper bound on the fitting time:  68%|██████▊   | 272/400 [00:00<00:00, 665.27it/s]Upper bound on the fitting time:  85%|████████▌ | 341/400 [00:00<00:00, 672.94it/s]Upper bound on the fitting time: 100%|██████████| 400/400 [00:00<00:00, 671.85it/s]
Maximum number of iterations (400)  reached in 0.6 seconds.
Last  criterion = 5.4e-06 . Required tolerance = 1e-06
======================================================================

DONE!
Best model (lower BIC): penalty 0.1
    Best model(lower AIC): penalty  0.1

======================================================================

4.1 Evaluate model quality

collection.show(figsize=(8, 8))

This displays the BIC, AIC and log-likelihood criteria for each model, helping to identify the most suitable one. Additionally, the number of links (i.e., non-zero elements in the precision matrix) is provided. You may consider using the ELBOW method to select the optimal model based on the number of links.

4.2 Selecting the best model

The best model can be chosen according to the BIC or AIC:

best_net = collection.best_model("BIC") # Could be also AIC.
print(best_net)
A multivariate PlnNetwork with penalty 0.1 on the precision matrix and lasso  penalty 0 on the regression coefficients.
======================================================================
     Loglike   Dimension    Nb param         BIC         AIC         ICL    Nb edges
   -22940.33          20         227  23620.3633   23167.332    10461.07         187

======================================================================
* Useful attributes
    .latent_variables .latent_positions .coef .covariance .precision .model_parameters .latent_parameters .optim_details
* Useful methods
    .transform() .show() .predict() .sigma() .projected_latent_variables() .plot_correlation_circle() .biplot() .viz() .pca_pairplot() .plot_expected_vs_true()
* Additional attributes for PlnNetwork are:
    .network .nb_zeros_precision
* Additional methods for PlnNetwork are:
    .viz_network()

One can also access each individual model in the collection, using the penalty as a key:

print(collection[10])
A multivariate PlnNetwork with penalty 10 on the precision matrix and lasso  penalty 0 on the regression coefficients.
======================================================================
     Loglike   Dimension    Nb param         BIC         AIC         ICL    Nb edges
   -23557.72          20         172  24072.9867  23729.7207    10589.69         132

======================================================================
* Useful attributes
    .latent_variables .latent_positions .coef .covariance .precision .model_parameters .latent_parameters .optim_details
* Useful methods
    .transform() .show() .predict() .sigma() .projected_latent_variables() .plot_correlation_circle() .biplot() .viz() .pca_pairplot() .plot_expected_vs_true()
* Additional attributes for PlnNetwork are:
    .network .nb_zeros_precision
* Additional methods for PlnNetwork are:
    .viz_network()

and loop through the collection:

for net in collection.values():
    print(net)

4.3 Inducing sparsity on the regression matrix

By default, PlnNetwork does not impose any sparsity on the regression matrix. To induce sparsity, set the penalty_coef parameter to a positive float when initializing the model:

net = PlnNetwork.from_formula("endog ~ 1 + labels", data = data,  penalty_coef=20, penalty=200).fit()
net.show()
Setting the offsets to zero.
Fitting a PlnNetwork model with penalty 200 on the precision matrix and lasso  penalty 20 on the regression coefficients.
Intializing parameters ...
Initialization finished.
Upper bound on the fitting time:   0%|          | 0/400 [00:00<?, ?it/s]Upper bound on the fitting time:  17%|█▋        | 68/400 [00:00<00:00, 672.44it/s]Upper bound on the fitting time:  34%|███▍      | 136/400 [00:00<00:00, 673.39it/s]Upper bound on the fitting time:  51%|█████     | 204/400 [00:00<00:00, 675.82it/s]Upper bound on the fitting time:  68%|██████▊   | 272/400 [00:00<00:00, 656.90it/s]Upper bound on the fitting time:  85%|████████▌ | 340/400 [00:00<00:00, 662.94it/s]Upper bound on the fitting time: 100%|██████████| 400/400 [00:00<00:00, 668.17it/s]
Maximum number of iterations (400)  reached in 0.6 seconds.
Last  criterion = 5.4e-06 . Required tolerance = 1e-06

The default sparsity is the lasso penalty, but you can also use the group_lasso and sparse_group_lasso penalties. This can be done by specifying the penalty_coef_type parameter when initializing the model:

net = PlnNetwork.from_formula("endog ~ 1 + labels", data = data, penalty_coef_type="group_lasso",  penalty_coef=50, penalty=200).fit()
net.show()
Setting the offsets to zero.
Fitting a PlnNetwork model with penalty 200 on the precision matrix and group_lasso  penalty 50 on the regression coefficients.
Intializing parameters ...
Initialization finished.
Upper bound on the fitting time:   0%|          | 0/400 [00:00<?, ?it/s]Upper bound on the fitting time:  16%|█▌        | 64/400 [00:00<00:00, 631.56it/s]Upper bound on the fitting time:  33%|███▎      | 132/400 [00:00<00:00, 657.47it/s]Upper bound on the fitting time:  50%|█████     | 201/400 [00:00<00:00, 670.88it/s]Upper bound on the fitting time:  68%|██████▊   | 270/400 [00:00<00:00, 677.57it/s]Upper bound on the fitting time:  85%|████████▍ | 339/400 [00:00<00:00, 680.24it/s]Upper bound on the fitting time: 100%|██████████| 400/400 [00:00<00:00, 675.01it/s]
Maximum number of iterations (400)  reached in 0.6 seconds.
Last  criterion = 5.39e-06 . Required tolerance = 1e-06

The coefficients of the regression matrix are therefore sparse.

References

Batardière, Bastien, Joon Kwon, and Julien Chiquet. 2024. “pyPLNmodels: A Python Package to Analyze Multivariate High-Dimensional Count Data.” Journal of Open Source Software.
Chiquet, Julien, Stephane Robin, and Mahendra Mariadassou. 2019. “Variational Inference for Sparse Network Reconstruction from Count Data.” In Proceedings of the 36th International Conference on Machine Learning, 97:1162–71. Proceedings of Machine Learning Research. PMLR.