PlnMixture

The PlnMixture model clusters count data. Clusters can be accessed through the .clusters attribute after fitting the model (.fit() method). The use of covariates is possible, but the regression coefficient is shared among all clusters. The performance may decrease significantly with the number of covariates. Note that the number of clusters is a hyperparameter that needs to be set by the user.

For an in-depth tutorial to the PlnMixture model, see the clustering tutorial.

PlnMixture Documentation

class pyPLNmodels.PlnMixture(endog, n_cluster, *, exog=None, add_const=False, offsets=None, compute_offsets_method='zero')[source]

Pln mixture models, that is a gaussian mixture model with Poisson layer on top of it. The effect of covariates is shared with clusters. Note that stability may significantly decrease with the number of covariates.

Examples

>>> import seaborn as sns
>>> from pyPLNmodels import PlnMixture, load_scrna, plot_confusion_matrix
>>> data = load_scrna()
>>> mixture = PlnMixture(data["endog"],n_cluster = 3)
>>> mixture.fit()
>>> print(mixture)
>>> plot_confusion_matrix(mixture.clusters, data["labels"])

Parameters:

endog (Tensor | ndarray | DataFrame)
n_cluster (int)
exog (Tensor | ndarray | DataFrame | None)
add_const (bool)
offsets (Tensor | ndarray | DataFrame | None)
compute_offsets_method ({'logsum', 'zero'})

__init__(endog, n_cluster, *, exog=None, add_const=False, offsets=None, compute_offsets_method='zero')[source]

Initializes the model class.

Parameters:

endog (Union[torch.Tensor, np.ndarray, pd.DataFrame]) – The count data.
exog (Union[torch.Tensor, np.ndarray, pd.DataFrame], optional(keyword-only)) – The covariate data. Defaults to None.
offsets (Union[torch.Tensor, np.ndarray, pd.DataFrame], optional(keyword-only)) – The offsets data. Defaults to None.
compute_offsets_method (str ("zero", "logsum"), optional(keyword-only)) –
Method to compute offsets if not provided. Options are:
- ”zero” that will set the offsets to zero.
- ”logsum” that will take the logarithm of the sum (per line) of the counts.
Overridden (useless) if offsets is not None. Default is “zero”.
add_const (bool, optional(keyword-only)) – Whether to add a column of one in the exog. Defaults to True.
n_cluster (int) – The number of clusters in the model.

Return type:

PlnMixture

See also

pyPLNmodels.PlnMixture.from_formula()

Examples

>>> from pyPLNmodels import PlnMixture, load_scrna
>>> data = load_scrna()
>>> mixture = PlnMixture(data["endog"],n_cluster = 3)
>>> mixture.fit()
>>> print(mixture)

Notes

The add_const keyword is useless here and ignored. Adding an intercept in the covariates results in non-identifiable coefficients for the mixture model.

classmethod from_formula(formula, data, n_cluster, *, compute_offsets_method='zero')[source]

Create an instance from a formula and data.

Parameters:

formula (str) – The formula.
data (dict) – The data dictionary. Each value can be either a torch.Tensor, np.ndarray, pd.DataFrame or pd.Series. The categorical exogenous variables should be 1-dimensional.
compute_offsets_method (str, optional(keyword-only)) –
Method to compute offsets if not provided. Options are:
- ”zero” that will set the offsets to zero.
- ”logsum” that will take the logarithm of the sum (per line) of the counts.
Overridden (useless) if data[“offsets”] is not None.
n_cluster (int) – The number of clusters in the model.

Return type:

PlnMixture

See also

pyPLNmodels.PlnMixture()

Examples

>>> from pyPLNmodels import PlnMixture, load_scrna
>>> data = load_scrna()
>>> mixture = PlnMixture.from_formula("endog ~ 0",data, n_cluster = 3)
>>> mixture.fit()
>>> print(mixture)

fit(*, maxiter=400, lr=0.01, tol=1e-06, verbose=False)[source]

Fit the model using variational inference. The lower the tol (tolerance), the more accurate the model.

Parameters:

maxiter (int, optional) – The maximum number of iterations to be done. Defaults to 400.
lr (float, optional(keyword-only)) – The learning rate. Defaults to 0.01.
tol (float, optional(keyword-only)) – The tolerance for convergence. Defaults to 1e-6.
verbose (bool, optional(keyword-only)) – Whether to print training progress. Defaults to False.

Raises:

ValueError – If maxiter is not an int.

Return type:

PlnMixture object

Examples

>>> from pyPLNmodels import PlnMixture, load_scrna
>>> data = load_scrna()
>>> mixture = PlnMixture.from_formula("endog ~ 0", data, n_cluster = 3)
>>> mixture.fit()
>>> print(mixture)

>>> from pyPLNmodels import PlnMixture, load_scrna
>>> data = load_scrna()
>>> mixture = PlnMixture.from_formula("endog ~ 0", data, n_cluster = 3)
>>> mixture.fit(maxiter=500, verbose=True)
>>> print(mixture)

property covariances: Tensor of covariances of shape (n_cluster, dim). Each vector corresponds to the (diagonal) covariance of each cluster.

property weights: Probability of a sample to belong to each cluster. Tensor of size (n_cluster).

viz(*, ax=None, colors=None, show_cov=False, remove_exog_effect=False)[source]

Visualize the latent variables. One can remove the effect of exogenous variables with the remove_exog_effect boolean variable.

Parameters:

ax (matplotlib.axes.Axes, optional) – The axes on which to plot, by default None.
colors (list, optional) – The labels to color the samples, of size n_samples. If None, will take the inferred clusters.
show_cov (bool, optional) – Whether to show covariances, by default False.
remove_exog_effect (bool, optional) – Whether to remove or not the effect of exogenous variables. Default to False.

Examples

>>> from pyPLNmodels import PlnMixture, load_scrna
>>> data = load_scrna()
>>> mixture = PlnMixture.from_formula("endog ~ 0", data=data, n_cluster = 3)
>>> mixture.fit()
>>> mixture.viz()
>>> mixture.viz(colors=data["labels"])
>>> mixture.viz(show_cov=True)
>>> mixture.viz(remove_exog_effect=True, colors=data["labels"])

property covariance: The GMM does not have a single covariance. It has multiple covariances, one per cluster. You may call .covariances.

biplot(column_names, *, column_index=None, colors=None, title='', remove_exog_effect=False)[source]

Visualizes variables using the correlation circle along with the pca transformed samples. If the endog has been given as a pd.DataFrame, the column_names have been stored and may be indicated with the column_names argument. Else, one should provide the indices of variables.

Parameters:

column_names (List[str]) – A list of variable names to visualize. If column_index is None, the variables plotted are the ones in column_names. If column_index is not None, this only serves as a legend. Check the attribute column_names_endog.
column_index (Optional[List[int]], optional keyword-only) – A list of indices corresponding to the variables that should be plotted. If None, the indices are determined based on column_names_endog given the column_names, by default None. If not None, should have the same length as column_names.
title (str optional, keyword-only) – An additional title for the plot.
colors (list, optional, keyword-only) – The labels to color the samples, by default the inferred clusters.
remove_exog_effect (bool, optional) – Whether to remove or not the effect of exogenous variables. Default to False.

Raises:

ValueError – If column_index is not None and the length of column_index is different from the length of column_names.

Examples

>>> from pyPLNmodels import PlnMixture, load_scrna
>>> data = load_scrna()
>>> mixture = PlnMixture.from_formula("endog ~ 0", data=data, n_cluster = 3)
>>> mixture.fit()
>>> mixture.biplot(column_names=["MALAT1", "ACTB"])
>>> mixture.biplot(
>>>    column_names=["A", "B"],
>>>    column_index=[1, 3],
>>>    colors=data["labels"],)

property clusters: The predicted clusters of each sample.

compute_elbo()[source]: Compute the elbo of the current parameters.

property dict_latent_parameters: The latent parameters of the model.

property dict_model_parameters: The parameters of the model.

property latent_positions

The (conditional) mean of the latent variables with the effect of covariates removed.

Examples

>>> from pyPLNmodels import PlnMixture, load_scrna
>>> data = load_scrna()
>>> mixture = PlnMixture.from_formula("endog ~ 0", data, n_cluster = 3)
>>> mixture.fit()
>>> print("Shape latent positions: ", mixture.latent_positions.shape)
>>> mixture.viz(remove_exog_effect=True) # Visualize the latent positions

property latent_variables

The (conditional) mean of the latent variables. This is the best approximation of latent variables. This variable is supposed to be more meaningful than the counts (endog).

Examples

>>> from pyPLNmodels import PlnMixture, load_scrna
>>> data = load_scrna()
>>> mixture = PlnMixture.from_formula("endog ~ 0", data, n_cluster = 3)
>>> mixture.fit()
>>> print("Shape latent variables: ", mixture.latent_variables.shape)
>>> mixture.viz() # Visualize the latent variables

property latent_means: Tensor of latent_means of shape (n_cluster, n_samples, dim). Each vector corresponds to the latent_mean of each cluster.

property latent_sqrt_variances: Tensor of latent_sqrt_variances of shape (n_cluster, n_samples, dim). Each vector corresponds to the latent_sqrt_variance of each cluster.

property latent_prob: Latent probability that sample i corresponds to cluster k. Returns a torch.Tensor of size (n_samples, n_cluster).

property list_of_parameters_needing_gradient: The list of all the parameters of the model that needs to be updated at each iteration.

property number_of_parameters: Number of parameters.

pca_pairplot(n_components=3, colors=None, remove_exog_effect=False)[source]

Generates a scatter matrix plot based on Principal Component Analysis (PCA) on the latent variables.

Parameters:

n_components (int, optional) – The number of components to consider for plotting. Defaults to 3. It cannot be greater than 6.
colors (np.ndarray, optional) – An array with one label for each sample in the endog property of the object. Defaults to the inferred clusters.
remove_exog_effect (bool, optional) – Whether to remove or not the effect of exogenous variables. Defaults to False.

Raises:

ValueError – If the number of components requested is greater than the number of variables in the dataset.

Examples

>>> from pyPLNmodels import PlnMixture, load_scrna
>>> data = load_scrna()
>>> mixture = PlnMixture.from_formula("endog ~ 0", data=data, n_cluster=3)
>>> mixture.fit()
>>> mixture.pca_pairplot(n_components=5)
>>> mixture.pca_pairplot(n_components=5, colors=data["labels"])

plot_correlation_circle(column_names, column_index=None, title='')[source]

Visualizes variables using PCA and plots a correlation circle. If the endog has been given as a pd.DataFrame, the column_names have been stored and may be indicated with the column_names argument. Else, one should provide the indices of variables.

Parameters:

column_names (List[str]) – A list of variable names to visualize. If column_index is None, the variables plotted are the ones in column_names. If column_index is not None, this only serves as a legend. Check the attribute column_names_endog.
column_index (Optional[List[int]], optional) – A list of indices corresponding to the variables that should be plotted. If None, the indices are determined based on column_names_endog given the column_names, by default None. If not None, should have the same length as column_names.
title (str) – An additional title for the plot.

Raises:

ValueError – If column_index is None and column_names_endog is not set, that has been set if the model has been initialized with a pd.DataFrame as endog.
ValueError – If the length of column_index is different from the length of column_names.

Examples

>>> from pyPLNmodels import PlnMixture, load_scrna
>>> data = load_scrna()
>>> mixture = PlnMixture.from_formula("endog ~ 0", data=data, n_cluster = 3)
>>> mixture.fit()
>>> mixture.plot_correlation_circle(column_names=["MALAT1", "ACTB"])
>>> mixture.plot_correlation_circle(column_names=["A", "B"], column_index=[0, 4])

property n_cluster: Number of clusters in the model.

property cluster_bias: The mean that is associated to each cluster, of size (n_cluster, dim). This does not encompass the mean that does not depend on each cluster. Each vector cluster_bias[k] is the bias associated to cluster k.

predict_clusters(endog, *, exog=None, offsets=None)[source]

Predict the clusters of the given endog and exog. The dimensions of endog, exog, and offsets should match the ones given in the model.

Parameters:

endog (Union[torch.Tensor, np.ndarray, pd.DataFrame]) – The count data.
exog (Union[torch.Tensor, np.ndarray, pd.DataFrame], optional(keyword-only)) – The covariate data. Defaults to None.
offsets (Union[torch.Tensor, np.ndarray, pd.DataFrame], optional(keyword-only)) – The offsets data. Defaults to None.

Raises:

ValueError – If the endog (or exog) has wrong shape compared to the previously fitted endog (or exog) variables.

Returns:

list

Return type:

The predicted clusters

Examples

>>> from pyPLNmodels import PlnMixture, load_scrna
>>> data = load_scrna()
>>> mixture = PlnMixture(data["endog"], n_cluster = 3).fit()
>>> pred = mixture.predict_clusters(data["endog"])
>>> print('pred', pred)

sigma()[source]: Covariance of the model.

property entropy: Entropy of the latent variables.

property WCSS

Compute the Within-Cluster Sum of Squares on the latent positions.

The higher the better, but increasing n_cluster can only increase the metric. A trade-off (with the elbow method for example) must be applied.

Returns positive float.

property silhouette

Compute the silhouette score on the latent_positions. See scikit-learn.metrics.silhouette_score for more information.

The higher the better.

Returns float between -1 and 1.

property AIC: Akaike Information Criterion (AIC).

property BIC: Bayesian Information Criterion (BIC) of the model.

property ICL: Integrated Completed Likelihood criterion.

property coef

Property representing the regression coefficients of size (nb_cov, dim). If no exogenous (exog) is available, returns None.

Returns:: The coefficients or None if no coefficients are given in the model.
Return type:: torch.Tensor or None

property dim: Number of dimensions (i.e. variables) of the dataset.

property elbo: Returns the last elbo computed.

property endog

Property representing the endogenous variables (counts).

Returns:: The endogenous variables.
Return type:: torch.Tensor

property exog

Property representing the exogenous variables (covariates).

Returns:: The exogenous variables or None if no covariates are given in the model.
Return type:: torch.Tensor or None

property latent_mean

Property representing the latent mean conditionally on the observed counts, i.e. the conditional mean of the latent variable of each sample.

Returns:: The latent mean.
Return type:: torch.Tensor

property latent_parameters: Alias for dict_latent_parameters.

property latent_sqrt_variance

Property representing the latent square root variance conditionally on the observed counts, i.e. the square root variance of the latent variable of each sample.

Returns:: The square root of the latent variance.
Return type:: torch.Tensor

property latent_variance: Property representing the latent variance conditionally on the observed counts, i.e. the conditional variance of the latent variable of each sample.

property loglike: Alias for elbo.

property marginal_mean: The marginal mean of the model, i.e. the mean of the gaussian latent variable.

property model_parameters: Alias for dict_model_parameters.

property n_samples: Number of samples in the dataset.

property nb_cov: int: The number of exogenous variables.

property offsets

Property representing the offsets.

Returns:: The offsets.
Return type:: torch.Tensor

property optim_details

Property representing the optimization details.

Returns:: The dictionary of optimization details.
Return type:: dict

plot_expected_vs_true(ax=None, colors=None)

Plot the predicted value of the endog against the endog.

Parameters:

ax (Optional[matplotlib.axes.Axes], optional) – The matplotlib axis to use. If None, the current axis is used, by default None.
colors (Optional[Any], optional) – The labels to color the samples, of size n_samples. By default None (no colors).

Returns:

The matplotlib axis.

Return type:

matplotlib.axes.Axes

property precision

Property representing the precision of the model, that is the inverse covariance matrix.

Returns:: The precision matrix of size (dim, dim).
Return type:: torch.Tensor

predict(array_like=None)

projected_latent_variables(rank=2, remove_exog_effect=False)

Perform PCA on latent variables and return the projected variables.

Parameters:

rank (int, optional) – The number of principal components to compute, by default 2.
remove_exog_effect (bool, optional) – Whether to remove or not the effect of exogenous variables. Default to False.

Returns:

The projected variables.

Return type:

numpy.ndarray

remove_zero_columns = True

show(savefig=False, name_file='', figsize=(10, 10))

Display the model parameters, norm evolution of the parameters and the criterion.

Parameters:

savefig (bool, optional) – If True, the figure will be saved to a file. Default is False.
name_file (str, optional) – The name of the file to save the figure. Only used if savefig is True. Default is an empty string.
figsize (tuple of two positive floats.) – Size of the figure that will be created. By default (10,10)

transform(remove_exog_effect=False)

Returns the latent variables. Can be seen as a normalization of the counts given.

Parameters:: remove_exog_effect (bool (optional)) – Whether to remove or not the mean induced by the exogenous variables. Default is False.
Returns:: The transformed endogenous variables (latent variables of the model).
Return type:: torch.Tensor

optim: torch.optim.Optimizer

List of methods and attributes

Public Data Attributes:

`covariances`	Tensor of covariances of shape (n_cluster, dim).
`weights`	Probability of a sample to belong to each cluster.
`covariance`	The GMM does not have a single covariance.
`clusters`	The predicted clusters of each sample.
`dict_latent_parameters`	The latent parameters of the model.
`dict_model_parameters`	The parameters of the model.
`latent_positions`	The (conditional) mean of the latent variables with the effect of covariates removed.
`latent_variables`	The (conditional) mean of the latent variables.
`latent_means`	Tensor of latent_means of shape (n_cluster, n_samples, dim).
`latent_sqrt_variances`	Tensor of latent_sqrt_variances of shape (n_cluster, n_samples, dim).
`latent_prob`	Latent probability that sample i corresponds to cluster k.
`list_of_parameters_needing_gradient`	The list of all the parameters of the model that needs to be updated at each iteration.
`number_of_parameters`	Number of parameters.
`n_cluster`	Number of clusters in the model.
`cluster_bias`	The mean that is associated to each cluster, of size (n_cluster, dim).
`entropy`	Entropy of the latent variables.
`WCSS`	Compute the Within-Cluster Sum of Squares on the latent positions.
`silhouette`	Compute the silhouette score on the latent_positions.
`optim`

Inherited from BaseModel

`remove_zero_columns`
`list_of_parameters_needing_gradient`	The list of all the parameters of the model that needs to be updated at each iteration.
`dict_model_parameters`	The parameters of the model.
`model_parameters`	Alias for dict_model_parameters.
`dict_latent_parameters`	The latent parameters of the model.
`latent_parameters`	Alias for dict_latent_parameters.
`n_samples`	Number of samples in the dataset.
`dim`	Number of dimensions (i.e. variables) of the dataset.
`endog`	Property representing the endogenous variables (counts).
`exog`	Property representing the exogenous variables (covariates).
`nb_cov`	The number of exogenous variables.
`offsets`	Property representing the offsets.
`latent_mean`	Property representing the latent mean conditionally on the observed counts, i.e. the conditional mean of the latent variable of each sample.
`latent_variance`	Property representing the latent variance conditionally on the observed counts, i.e. the conditional variance of the latent variable of each sample.
`latent_sqrt_variance`	Property representing the latent square root variance conditionally on the observed counts, i.e. the square root variance of the latent variable of each sample.
`coef`	Property representing the regression coefficients of size (nb_cov, dim).
`covariance`	Property representing the covariance of the model.
`precision`	Property representing the precision of the model, that is the inverse covariance matrix.
`marginal_mean`	The marginal mean of the model, i.e. the mean of the gaussian latent variable.
`latent_variables`	The (conditional) mean of the latent variables.
`latent_positions`	The (conditional) mean of the latent variables with the effect of covariates removed.
`elbo`	Returns the last elbo computed.
`loglike`	Alias for elbo.
`BIC`	Bayesian Information Criterion (BIC) of the model.
`ICL`	Integrated Completed Likelihood criterion.
`AIC`	Akaike Information Criterion (AIC).
`number_of_parameters`	Returns the number of parameters of the model.
`entropy`	Entropy of the latent variables.
`optim_details`	Property representing the optimization details.
`optim`

Public Methods:

`__init__`(endog, n_cluster, *[, exog, ...])	Initializes the model class.
`from_formula`(formula, data, n_cluster, *[, ...])	Create an instance from a formula and data.
`fit`(*[, maxiter, lr, tol, verbose])	Fit the model using variational inference.
`viz`(*[, ax, colors, show_cov, ...])	Visualize the latent variables.
`biplot`(column_names, *[, column_index, ...])	Visualizes variables using the correlation circle along with the pca transformed samples.
`compute_elbo`()	Compute the elbo of the current parameters.
`pca_pairplot`([n_components, colors, ...])	Generates a scatter matrix plot based on Principal Component Analysis (PCA) on the latent variables.
`plot_correlation_circle`(column_names[, ...])	Visualizes variables using PCA and plots a correlation circle.
`predict_clusters`(endog, *[, exog, offsets])	Predict the clusters of the given endog and exog.
`sigma`()	Covariance of the model.

Inherited from BaseModel

`__init__`(endog, *[, exog, offsets, ...])	Initializes the model class.
`from_formula`(formula, data, *[, ...])	Create an instance from a formula and data.
`fit`(*[, maxiter, lr, tol, verbose])	Fit the model using variational inference.
`show`([savefig, name_file, figsize])	Display the model parameters, norm evolution of the parameters and the criterion.
`plot_correlation_circle`(column_names[, ...])	Visualizes variables using PCA and plots a correlation circle.
`biplot`(column_names, *[, column_index, ...])	Visualizes variables using the correlation circle along with the pca transformed samples.
`compute_elbo`()	Compute the elbo of the current parameters.
`projected_latent_variables`([rank, ...])	Perform PCA on latent variables and return the projected variables.
`transform`([remove_exog_effect])	Returns the latent variables.
`viz`(*[, ax, colors, show_cov, ...])	Visualize the latent variables.
`__repr__`()	Generate the string representation of the model.
`predict`([array_like])
`sigma`()	Covariance of the model.
`pca_pairplot`([n_components, colors, ...])	Generates a scatter matrix plot based on Principal Component Analysis (PCA) on the latent variables.
`plot_expected_vs_true`([ax, colors])	Plot the predicted value of the endog against the endog.

Private Data Attributes:

`_covariance`
`_covariances`
`_description`	Description of the model.
`_dict_for_printing`	Property representing the dictionary for printing.
`_endog_predictions`	Abstract method the predict the endog variables.
`_clusters`
`_marginal_means`
`_additional_attributes_list`	The attributes that are specific to this model.
`_additional_methods_list`	The methods that are specific to this model.
`_abc_impl`
`_weights`
`_cluster_bias`
`_latent_prob`
`_latent_means`
`_latent_sqrt_variances`
`_sqrt_covariances`
`_per_sample_per_cluster_elbo`
`_time_recorder`
`_dict_list_mse`
`_latent_mean`
`_latent_sqrt_variance`
`_coef`

Inherited from BaseModel

`_name`
`_description`	Description of the model.
`_default_dict_model_parameters`
`_default_dict_latent_parameters`
`_precision`
`_marginal_mean`
`_useful_methods_list`
`_useful_attributes_list`
`_additional_attributes_list`	The attributes that are specific to this model.
`_additional_methods_list`	The methods that are specific to this model.
`_dict_for_printing`	Property representing the dictionary for printing.
`_endog_predictions`	Abstract method the predict the endog variables.
`_latent_dim`
`_abc_impl`
`_time_recorder`
`_dict_list_mse`
`_latent_mean`
`_latent_sqrt_variance`
`_coef`
`_covariance`

Inherited from ABC

_abc_impl

Private Methods:

`_init_parameters`()
`_init_latent_parameters`()	Everything is done in the _init_parameters method.
`_init_model_parameters`()	Everything is done in the _init_parameters method.
`_update_closed_forms`()	Update some parameters.
`_get_two_dim_latent_variances`(sklearn_components)	Computes the covariance when the latent variables are embedded in a lower dimensional space (often 2) with sklearn_components.

Inherited from BaseModel

`_get_model_viz`()
`_trainstep`()	Compute the elbo and do a gradient step.
`_compute_loss`(elbo)
`_initialize_timing`()
`_print_beginning_message`()
`_print_end_of_fitting_message`(...)
`_init_parameters`()
`_print_start_init`()
`_print_end_init`()
`_init_model_parameters`()	Initialization of model parameters.
`_init_latent_parameters`()	Initialization of latent parameters.
`_set_requiring_grad_true`()	Move parameters to the GPU device if present.
`_handle_optimizer`(lr)
`_fitting_initialization`(lr, maxiter)
`_project_parameters`()	Project some parameters such as probabilities.
`_update_closed_forms`()	Update some parameters.
`_track_mse`()
`_print_stats`(iterdone, maxiter, tol)	Print the training statistics.
`_pca_projected_latent_variables_with_covariances`([...])	Perform PCA on latent variables and return the projected variables along with their covariances in the two dimensional space.
`_get_two_dim_latent_variances`(sklearn_components)	Computes the covariance when the latent variables are embedded in a lower dimensional space (often 2) with sklearn_components.