mdsa_tools.Analysis

High-level wrapper for storing and running common analyses over systems of residue–residue adjacency matrices (e.g., H-bond networks).

You can follow the provided pipeline end-to-end, or call individual, modular steps (flattening → feature matrix → dimensionality reduction → clustering). For example, you might cluster arbitrary n-dimensional feature matrices, or pull H-bond values via systems_analysis.extract_hbond_values() and use those in replicate maps instead of k-means cluster assignments.

See Also

mdsa_tools.Viz.visualize_reduction : Plot PCA/UMAP embeddings. mdsa_tools.Data_gen_hbond.create_system_representations : Build residue–residue H-bond adjacency matrices. numpy.linalg.svd : Linear algebra used under the hood.

Classes

systems_analysis([systems_representations, ...])

Container for systems-level analyses on residue–residue adjacency matrices.

class mdsa_tools.Analysis.systems_analysis(systems_representations=None, replicate_distribution=None, precomputed_feature_matrix=None, res_indexes_for_precomputedmatrix=None)

Bases: object

Container for systems-level analyses on residue–residue adjacency matrices.

This class assumes each “system representation” is a 3D array with shape (n_frames, n_residues, n_residues) where a single frame contains a residue×residue adjacency matrix (e.g., H-bond counts/weights). Many methods operate on a feature matrix constructed by flattening/stacking per-frame upper triangles.

Parameters:
systems_representationslist of np.ndarray

[array_1, ..., array_m], each of shape (n_frames, n_residues, n_residues). The only permitted difference across arrays is n_frames; all systems should share the same residue dimension. Index metadata are expected in row/column 0.

replicate_distributionarray-like of int

Optional labels or indices describing how frames are distributed across replicates/systems. If None, defaults to np.arange(0, systems_representations[0].shape[0]).

Attributes:
num_systemsint

Number of systems provided (len(systems_representations)).

systems_representationslist of np.ndarray

Original input, retained for convenience.

indexesnp.ndarray

Derived from systems_representations[0][0, 0, 1:]. These are used to label residue–residue comparisons when generating feature names.

feature_matrixnp.ndarray

Cached 2D feature matrix produced by replicates_to_feature_matrix.

replicate_distributionnp.ndarray

Distribution vector set from the provided argument or default.

Methods

assignment_schemas([centers, labels, ...])

autocorrelation_function([...])

cluster_embeddingspace([...])

(Deprecated) Cluster each system independently in a reduced embedding space.

compute_cluster_assignment_schemas([...])

create_PCA_ranked_weights([outfile_path, ...])

Create a ranked table of PCA feature weights for the first two principal components.

extract_hbond_values(residues[, ...])

Aggregate per-frame H-bond values over a chosen residue set.

pearson_sums_only([...])

Compute Pearson r between ALL within-space pairwise Euclidean distances of feature space (X) and embedding space (Y) without materializing the O(n^2) distance vectors.

perform_clust_opt(outfile_path[, ...])

Sweep K for KMeans and pick “best” K by silhouette and elbow, or fit a single K.

perform_kmeans([outfile_path, max_clusters, ...])

Run KMeans clustering on a feature matrix, either: (a) sweeping K to select optima by silhouette and elbow criteria, or (b) fitting once at a fixed K.

perform_optimized_UMAP_global([...])

Grid-search UMAP over n_neighbors and min_dist, and for each embedding compute Pearson r between pdist(feature space) and pdist(embedding space).

perform_optimized_UMAP_local([...])

Grid-search UMAP over n_neighbors and min_dist, and for each embedding compute Pearson r between pdist(feature space) and pdist(embedding space).

reduce_systems_representations([...])

Reduce the dimensionality of a per-frame feature matrix using PCA or UMAP.

replicates_to_feature_matrix([arrays])

Construct a flattened, per-frame feature matrix from one or more systems of residue×residue adjacency matrices.

run_PCA(feature_matrix, n)

Run Principal Components Analysis (PCA) and return transformed coordinates, component loadings, and explained variance ratios.

t_test_autocorrelation

Notes

  • Many operations expect systems of the same residue size.

  • Designed for comparative analyses but also works with a single system.

  • The class focuses on simple, modular wrappers around common steps (flattening, reduction, clustering) for downstream use.

assignment_schemas(centers=None, labels=None, feature_matrix=None, indexes=None)
Parameters:
centersnp.ndarray, shape== (n_clusters, n_features)

Cluster centers (e.g., from KMeans). If None (or if labels is None), centers and labels are obtained from perform_clust_opt().

labelsarray-like of int, shape=(n_samples,)

Cluster assignment for each sample/frame.

feature_matrixnp.ndarray, shape=(n_samples, n_features)

Per-sample feature matrix. If None, uses self.feature_matrix. Features are assumed to correspond to the upper-triangle residue–residue pairs (excluding diagonal), as produced by replicates_to_feature_matrix.

indexesarray-like of int

Residue indices used to label features. If None, uses self.indexes. Labels are generated as strings "i-j" in the same order as the upper-triangle feature construction.

Returns:
pandas.DataFrame

Tidy table with one row per (centroid, pair) containing:

  • pair — residue–residue label "i-j" (str)

  • count — number of samples in that centroid for which this pair

was the maximum-contributing feature (int) * centroid — centroid/cluster id (int)

The table can be sorted/grouped to retrieve the top pairs per centroid.

Notes

  • Per-feature contributions are computed as (x_j - c_j)**2 for each

feature j and assigned sample x relative to its centroid c. * The “dominant pair” for a sample is the feature index with the maximum squared deviation (ties take the first occurrence although for our implementation use case there was not much of an issue).

Examples

>>> labels, centers = sa.perform_kmeans(k=2)
>>> df = sa.compute_cluster_assignment_schemas(centers, labels)
>>> df.sort_values(["centroid","count"], ascending=[True, False])         ...   .groupby("centroid").head(20)  
autocorrelation_function(feature_space_coordinates=None, lags=None)
Parameters:
Returns:
cluster_embeddingspace(reduced_coordinates=None, outfile_path=None, num_systems=None, val_metric=None)

(Deprecated) Cluster each system independently in a reduced embedding space.

compute_cluster_assignment_schemas(centers=None, labels=None, feature_matrix=None, indexes=None)
Parameters:
centersnp.ndarray, shape== (n_clusters, n_features)

Cluster centers (e.g., from KMeans). If None (or if labels is None), centers and labels are obtained from perform_clust_opt().

labelsarray-like of int, shape=(n_samples,)

Cluster assignment for each sample/frame.

feature_matrixnp.ndarray, shape=(n_samples, n_features)

Per-sample feature matrix. If None, uses self.feature_matrix. Features are assumed to correspond to the upper-triangle residue–residue pairs (excluding diagonal), as produced by replicates_to_feature_matrix.

indexesarray-like of int

Residue indices used to label features. If None, uses self.indexes. Labels are generated as strings "i-j" in the same order as the upper-triangle feature construction.

Returns:
pandas.DataFrame

Tidy table with one row per (centroid, pair) containing:

  • pair — residue–residue label "i-j" (str)

  • count — number of samples in that centroid for which this pair

was the maximum-contributing feature (int) * centroid — centroid/cluster id (int)

The table can be sorted/grouped to retrieve the top pairs per centroid.

Notes

  • Per-feature contributions are computed as (x_j - c_j)**2 for each

feature j and assigned sample x relative to its centroid c. * The “dominant pair” for a sample is the feature index with the maximum squared deviation (ties take the first occurrence although for our implementation use case there was not much of an issue).

Examples

>>> df.sort_values(["centroid","count"], ascending=[True, False])         ...   .groupby("centroid").head(20)  
create_PCA_ranked_weights(outfile_path=None, weights=None, indexes=None)

Create a ranked table of PCA feature weights for the first two principal components.

Parameters:
outfile_pathstr or pathlib.Path

Directory where outputs may be written. If None, uses the current working directory.

weightsnp.ndarray, shape== (n_components, n_features)

PCA component loadings (rows = components, columns = features). If None, this function calls reduce_systems_representations() to compute PCA (default n=2) and uses the returned weights.

indexesarray-like of int

Residue indices used to label pairwise comparisons. If None, uses self.indexes. These indices define the order used to generate upper-triangle residue–residue comparison labels (e.g., “12-47”).

Returns:
pandas.DataFrame

A table mapping each feature (upper-triangle residue pair) to its PCA weights and magnitudes with columns:

  • Comparisons — ‘i-j’ residue pair label (str)

  • PC1_Weights — raw loading for PC1 (float)

  • PC2_Weights — raw loading for PC2 (float)

  • PC1_magnitude — (PC1_Weights)**2 (float)

  • PC2_magnitude — (PC2_Weights)**2 (float)

  • PC1_mag_norm — min–max normalized PC1_magnitude to [0, 1] (float) [optional if postprocessed]

  • PC2_mag_norm — min–max normalized PC2_magnitude to [0, 1] (float) [optional if postprocessed]

Notes

Only the upper triangle (excluding the diagonal) of the residue–residue matrix is used, so each row corresponds to a unique residue pair. The function assumes at least two principal components are available.

Examples

>>> df = sa.create_PCA_ranked_weights()
extract_hbond_values(residues, systems_array=None, mode='sum')

Aggregate per-frame H-bond values over a chosen residue set.

Parameters:
residueslist of int

Residue indices to retain (e.g., a surface or motif). All pairwise combinations among these residues are used.

systems_arraynp.ndarray, shape=``(n_frames, n_res, n_res)``

Averaged array or a single 3D trajectory array. If None, all systems in self.systems_representations are concatenated.

mode{‘sum’,’average’}, default=’sum’

Aggregation across the upper triangle for each frame.

Returns:
np.ndarray

Shape=``(n_frames,)`` containing the aggregated value per frame.

Notes

  • Index/label row/column 0 are dropped before aggregation.

  • Only the upper triangle is aggregated to avoid double counting.

Examples

>>> vals = sa.extract_hbond_values([12, 47, 91], mode='average')
pearson_sums_only(feature_space_coordinates=None, embedding_space_coordinates=None)

Compute Pearson r between ALL within-space pairwise Euclidean distances of feature space (X) and embedding space (Y) without materializing the O(n^2) distance vectors. We stream the five totals needed for Pearson.

Parameters:
feature_space_coordinatesnp.ndarray, shape=(n_samples, p)

High-D per-sample feature rows. If None, we build self.feature_matrix.

embedding_space_coordinatesnp.ndarray, shape=(n_samples, q)

Low-D per-sample embedding rows. If None, we build a PCA embedding.

Returns:
float

Pearson correlation coefficient r in [-1, 1] between pdist(X) and pdist(Y).

Notes

  • We iterate over the upper triangle: for each row i, compare to all rows j>i.

  • Distances are computed via ‖a-b‖² = ‖a‖² + ‖b‖² − 2·(a·b), so we only need

squared row norms and a BLAS-backed dot product (Xj @ xi). * We never keep the giant distance vectors, only the 5 running sums.

Examples

>>> r = obj.pearson_sums_only(X, Y)  # exact Pearson of pdist(X) vs pdist(Y)
perform_clust_opt(outfile_path, max_clusters=None, data=None, k=None)

Sweep K for KMeans and pick “best” K by silhouette and elbow, or fit a single K.

Parameters:
outfile_pathstr or pathlib.Path

If provided, per-K label arrays are saved via np.save as "<outfile_path>kluster_labels_{K}clust.npy" and selection plots are delegated to mdsa_tools.Viz. If None, nothing is written and elbow selection falls back to a simple heuristic.

max_clustersint

Upper bound (inclusive) for the K sweep (default 10). The sweep runs from K=2 through K=max_clusters.

datanp.ndarray, shape=(n_samples, n_features)

Feature matrix to cluster. If None, uses self.feature_matrix.

kint

If set, skip the sweep and fit one KMeans at exactly k clusters.

Returns:
If k is None:
optimal_k_silhouette_labelsnp.ndarray of int, shape=(n_samples,)

Labels for the silhouette-selected K.

optimal_k_elbow_labelsnp.ndarray of int, shape=(n_samples,)

Labels for the elbow-selected K.

centers_sillohuettenp.ndarray, shape=(K_sil, n_features)

Cluster centers for the silhouette-selected K. (Legacy spelling kept.)

centers_elbownp.ndarray, shape=(K_elb, n_features)

Cluster centers for the elbow-selected K.

If k is not None:
cluster_labelsnp.ndarray of int, shape=(n_samples,)

Labels from the single KMeans run.

cluster_centersnp.ndarray, shape=(k, n_features)

Cluster centers from the single KMeans run.

Notes

  • Silhouette uses sklearn.metrics.silhouette_score on the provided feature space (no precomputed distance matrix).

  • Elbow choice:
    • If outfile_path is provided, the decision is delegated to mdsa_tools.Viz.plot_elbow_scores (with plotting).

    • If not, and too few points exist to fit a curve, falls back to the argmin of inertia over the available Ks.

  • KMeans(init='random', n_init=K, random_state=0) is used for reproducibility.

  • The variable/name sillohuette is spelled as-is to preserve legacy usage.

Examples

>>> labels_sil, labels_elb, C_sil, C_elb = sa.perform_clust_opt(None, max_clusters=8, data=X)
>>> labels, centers = sa.perform_clust_opt(None, data=X, k=3)
perform_kmeans(outfile_path=None, max_clusters=None, data=None, k=None)

Run KMeans clustering on a feature matrix, either: (a) sweeping K to select optima by silhouette and elbow criteria, or (b) fitting once at a fixed K.

Parameters:
outfile_pathstr or pathlib.Path

Directory where per-K label arrays are saved (via np.save) when sweeping K (i.e., when k is None). One file is written per value of K. If None, nothing is written to disk. Default is None.

max_clustersint

Upper bound on the number of clusters to consider when sweeping K. Effective only when k is None. The exact K range is determined by perform_clust_opt() (typically 2..max_clusters inclusive). Default is 10.

dataarray-like of shape=(n_samples, n_features)

Feature matrix to cluster. If None, uses self.feature_matrix.

kint

If provided, fit a single KMeans model with exactly k clusters and return its labels and centers. If None, perform a sweep over K and return the best solutions under the silhouette and elbow criteria.

Returns:
cluster_labelsndarray of shape=(n_samples,), dtype=int

(Only when k is not None) Cluster assignment for each sample, with labels in [0, k-1].

cluster_centersndarray of shape=(k, n_features)

(Only when k is not None) Centroids of the fitted model.

optimal_k_silhouette_labelsndarray of shape=(n_samples,), dtype=int

(Only when k is None) Labels for the K chosen by the silhouette criterion.

optimal_k_elbow_labelsndarray of shape=(n_samples,), dtype=int

(Only when k is None) Labels for the K chosen by the elbow (inertia) criterion.

centers_sillohuettendarray of shape=(k_silhouette, n_features)

(Only when k is None) Centers corresponding to optimal_k_silhouette_labels.

centers_elbowndarray of shape=(k_elbow, n_features)

(Only when k is None) Centers corresponding to optimal_k_elbow_labels.

Notes

  • When k is provided, this method fits sklearn.cluster.KMeans with init='random', n_init=k, and random_state=0 for reproducibility.

  • When k is None, selection of the optimal K and any per-K saving behavior are delegated to perform_clust_opt(), which is expected to return the label arrays and centers for the silhouette- and elbow-selected solutions.

Examples

Fit a fixed-K model:

>>> labels, centers = obj.perform_kmeans(k=5)

Sweep K (saving per-K labels):

>>> sil_labels, elbow_labels, sil_centers, elbow_centers =         ...     obj.perform_kmeans(outfile_path="results/kmeans", max_clusters=12)
perform_optimized_UMAP_global(feature_matrix=None, max_neighbors=None, min_neighbors=None, stepsize=None, min_dist_values=None)

Grid-search UMAP over n_neighbors and min_dist, and for each embedding compute Pearson r between pdist(feature space) and pdist(embedding space). Returns a tidy DataFrame for easy plotting (bubble chart etc.).

Parameters:
feature_matrixnp.ndarray, shape=(n_samples, n_features)

If None, use self.feature_matrix (and build it if needed).

max_neighborsint

Exclusive upper bound for the neighbors sweep. Default 100.

min_neighborsint

Inclusive lower bound for the neighbors sweep. Default 10.

stepsizeint

Step between neighbor counts. Default 10 (i.e., 10, 20, 30, …).

min_dist_valuesiterable of float

Set of UMAP min_dist values to try. Default (0.1, 0.4, 0.7, 1.0).

Returns:
pandas.DataFrame

Columns: - n_neighbors (int) - min_dist (float) - pearson_r (float, rounded to 2 decimals) - r01_for_bubbles (float in [0,100]): visual helper = 100*((r+1)/2)

Notes

  • UMAP embeddings depend on (n_neighbors, min_dist); we loop both.

  • Pearson is computed with pearson_sums_only (streaming, exact).

  • r01_for_bubbles maps r∈[-1,1] -> [0,100] for bubble size/color.

perform_optimized_UMAP_local(feature_matrix=None, max_neighbors=None, min_neighbors=None, stepsize=None, min_dist_values=None, eval_n=None)

Grid-search UMAP over n_neighbors and min_dist, and for each embedding compute Pearson r between pdist(feature space) and pdist(embedding space). Returns a tidy DataFrame for easy plotting (bubble chart etc.).

Parameters:
feature_matrixnp.ndarray, shape=(n_samples, n_features)

If None, use self.feature_matrix (and build it if needed).

max_neighborsint

Exclusive upper bound for the neighbors sweep. Default 100.

min_neighborsint

Inclusive lower bound for the neighbors sweep. Default 10.

stepsizeint

Step between neighbor counts. Default 10 (i.e., 10, 20, 30, …).

min_dist_valuesiterable of float

Set of UMAP min_dist values to try. Default (0.1, 0.4, 0.7, 1.0).

Returns:
pandas.DataFrame

Columns: - n_neighbors (int) - min_dist (float) - pearson_r (float, rounded to 2 decimals) - r01_for_bubbles (float in [0,100]): visual helper = 100*((r+1)/2)

Notes

  • UMAP embeddings depend on (n_neighbors, min_dist); we loop both.

  • Pearson is computed with pearson_sums_only (streaming, exact).

  • r01_for_bubbles maps r∈[-1,1] -> [0,100] for bubble size/color.

reduce_systems_representations(feature_matrix=None, method=None, n_components=None, min_dist=None, n_neighbors=None)

Reduce the dimensionality of a per-frame feature matrix using PCA or UMAP.

Parameters:
feature_matrixarray-like of shape=(n_samples, n_features)

Matrix to reduce. If omitted, uses self.feature_matrix; if that is not set, calls replicates_to_feature_matrix() to build it.

method{‘PCA’, ‘UMAP’}, default ‘PCA’

Dimensionality-reduction algorithm.

n_componentsint, default 2

Target embedding dimensionality.

min_distfloat

UMAP min_dist hyperparameter controlling how tightly points cluster (smaller ⇒ tighter clusters). Ignored for PCA. Default 0.5.

n_neighborsint

UMAP neighborhood size controlling local vs global structure. Ignored for PCA. Default 900.

Returns:
If method == ‘PCA’:
(X_pca, weights, explained_variance_ratio_)tuple
X_pcandarray of shape=(n_samples, n_components)

PCA scores.

weightsndarray of shape=(n_components, n_features)

Component loadings.

explained_variance_ratio_ndarray of shape=(n_components,)

Fraction of variance explained by each principal component.

If method == ‘UMAP’:
embeddingndarray of shape=(n_samples, n_components)

Low-dimensional embedding.

Notes

  • PCA uses sklearn.decomposition.PCA.

  • UMAP uses umap.UMAP with the provided hyperparameters.

Examples

>>> X, W, evr = sa.reduce_systems_representations(method='PCA', n_components=2)
>>> U = sa.reduce_systems_representations(method='UMAP', n_components=2)
replicates_to_feature_matrix(arrays=None) ndarray

Construct a flattened, per-frame feature matrix from one or more systems of residue×residue adjacency matrices.

Parameters:
arrayslist of np.ndarray

[array_1, ..., array_m] where each array has shape (n_frames, n_residues, n_residues). If None, uses self.systems_representations. The only axis allowed to differ across arrays is n_frames.

Returns:
np.ndarray

Shape=``(sum(n_frames), n_features)``, where n_features equals the number of unique residue–residue pairs from the upper triangle (excluding the diagonal) of the per-frame matrix. Each row corresponds to a single frame across all systems.

Notes

  • Index row/column 0 are dropped ([1:, 1:]) under the assumption they hold metadata (residue indices).

  • Only the upper triangle is used to avoid duplicate symmetric entries.

  • The result is cached to self.feature_matrix for reuse.

Examples

>>> arrays = [sys1, sys2]  # each (n_frames, n_res, n_res)
>>> X = sa.replicates_to_feature_matrix(arrays)
run_PCA(feature_matrix, n)

Run Principal Components Analysis (PCA) and return transformed coordinates, component loadings, and explained variance ratios.

Parameters:
feature_matrixnp.ndarray

Shape=``(n_samples, n_features)``. Typically produced by replicates_to_feature_matrix().

nint

Number of components to retain (default 2 in upstream callers).

Returns:
X_pcanp.ndarray, shape=(n_samples, n)

Transformed coordinates in the principal-component space.

weightsnp.ndarray, shape=(n, n_features)

PCA component loadings (rows = components, columns = original features).

explained_variancesnp.ndarray, shape=(n,)

Fraction of variance explained by each selected component.

Notes

Thin wrapper around sklearn.decomposition.PCA; fits on the provided feature matrix and returns the transformed coordinates, component loadings, and explained-variance ratios.

Examples

>>> X_pca, W, evr = sa.run_PCA(feature_matrix, 2)
>>> X_pca.shape=
(feature_matrix.shape[0], 2)
t_test_autocorrelation(results=None, m=None, alpha=None)