mdsa_tools.Analysis

High-level wrapper for storing and running common analyses over systems of residue–residue adjacency matrices (e.g., H-bond networks).

You can follow the provided pipeline end-to-end, or call individual, modular steps (flattening → feature matrix → dimensionality reduction → clustering). For example, you might cluster arbitrary n-dimensional feature matrices, or pull H-bond values via systems_analysis.extract_hbond_values() and use those in replicate maps instead of k-means cluster assignments.

See Also

mdsa_tools.Viz.visualize_reduction : Plot PCA/UMAP embeddings. mdsa_tools.Data_gen_hbond.create_system_representations : Build residue–residue H-bond adjacency matrices. numpy.linalg.svd : Linear algebra used under the hood.

Classes

systems_analysis([systems_representations, ...])

Container for systems-level analyses on residue–residue adjacency matrices.

class mdsa_tools.Analysis.systems_analysis(systems_representations=None, replicate_distribution=None, precomputed_feature_matrix=None, res_indexes_for_precomputedmatrix=None)

Bases: object

Container for systems-level analyses on residue–residue adjacency matrices.

This class assumes each “system representation” is a 3D array with shape (n_frames, n_residues, n_residues) where a single frame contains a residue×residue adjacency matrix (e.g., H-bond counts/weights). Many methods operate on a feature matrix constructed by flattening/stacking per-frame upper triangles.

Parameters:

systems_representationslist of np.ndarray: [array_1, ..., array_m], each of shape (n_frames, n_residues, n_residues). The only permitted difference across arrays is n_frames; all systems should share the same residue dimension. Index metadata are expected in row/column 0.
replicate_distributionarray-like of int: Optional labels or indices describing how frames are distributed across replicates/systems. If None, defaults to np.arange(0, systems_representations[0].shape[0]).

Attributes:

num_systemsint: Number of systems provided (len(systems_representations)).
systems_representationslist of np.ndarray: Original input, retained for convenience.
indexesnp.ndarray: Derived from systems_representations[0][0, 0, 1:]. These are used to label residue–residue comparisons when generating feature names.
feature_matrixnp.ndarray: Cached 2D feature matrix produced by replicates_to_feature_matrix.
replicate_distributionnp.ndarray: Distribution vector set from the provided argument or default.

Methods

`assignment_schemas`([centers, labels, ...])
`autocorrelation_function`([...])
`cluster_embeddingspace`([...])	(Deprecated) Cluster each system independently in a reduced embedding space.
`compute_cluster_assignment_schemas`([...])
`create_PCA_ranked_weights`([outfile_path, ...])	Create a ranked table of PCA feature weights for the first two principal components.
`extract_hbond_values`(residues[, ...])	Aggregate per-frame H-bond values over a chosen residue set.
`pearson_sums_only`([...])	Compute Pearson r between ALL within-space pairwise Euclidean distances of feature space (X) and embedding space (Y) without materializing the O(n^2) distance vectors.
`perform_clust_opt`(outfile_path[, ...])	Sweep K for KMeans and pick “best” K by silhouette and elbow, or fit a single K.
`perform_kmeans`([outfile_path, max_clusters, ...])	Run KMeans clustering on a feature matrix, either: (a) sweeping K to select optima by silhouette and elbow criteria, or (b) fitting once at a fixed K.
`perform_optimized_UMAP_global`([...])	Grid-search UMAP over n_neighbors and min_dist, and for each embedding compute Pearson r between pdist(feature space) and pdist(embedding space).
`perform_optimized_UMAP_local`([...])	Grid-search UMAP over n_neighbors and min_dist, and for each embedding compute Pearson r between pdist(feature space) and pdist(embedding space).
`reduce_systems_representations`([...])	Reduce the dimensionality of a per-frame feature matrix using PCA or UMAP.
`replicates_to_feature_matrix`([arrays])	Construct a flattened, per-frame feature matrix from one or more systems of residue×residue adjacency matrices.
`run_PCA`(feature_matrix, n)	Run Principal Components Analysis (PCA) and return transformed coordinates, component loadings, and explained variance ratios.

t_test_autocorrelation

Notes

Many operations expect systems of the same residue size.
Designed for comparative analyses but also works with a single system.
The class focuses on simple, modular wrappers around common steps (flattening, reduction, clustering) for downstream use.

assignment_schemas(centers=None, labels=None, feature_matrix=None, indexes=None)

Parameters:

centersnp.ndarray, shape== (n_clusters, n_features): Cluster centers (e.g., from KMeans). If None (or if labels is None), centers and labels are obtained from perform_clust_opt().
labelsarray-like of int, shape=(n_samples,): Cluster assignment for each sample/frame.
feature_matrixnp.ndarray, shape=(n_samples, n_features): Per-sample feature matrix. If None, uses self.feature_matrix. Features are assumed to correspond to the upper-triangle residue–residue pairs (excluding diagonal), as produced by replicates_to_feature_matrix.
indexesarray-like of int: Residue indices used to label features. If None, uses self.indexes. Labels are generated as strings "i-j" in the same order as the upper-triangle feature construction.

Returns:

pandas.DataFrame

Tidy table with one row per (centroid, pair) containing:

pair — residue–residue label "i-j" (str)
count — number of samples in that centroid for which this pair

was the maximum-contributing feature (int) * centroid — centroid/cluster id (int)

The table can be sorted/grouped to retrieve the top pairs per centroid.

Notes

Per-feature contributions are computed as (x_j - c_j)**2 for each

feature j and assigned sample x relative to its centroid c. * The “dominant pair” for a sample is the feature index with the maximum squared deviation (ties take the first occurrence although for our implementation use case there was not much of an issue).

Examples

>>> labels, centers = sa.perform_kmeans(k=2)
>>> df = sa.compute_cluster_assignment_schemas(centers, labels)
>>> df.sort_values(["centroid","count"], ascending=[True, False])         ...   .groupby("centroid").head(20)  

autocorrelation_function(feature_space_coordinates=None, lags=None)

Parameters:
Returns:

cluster_embeddingspace(reduced_coordinates=None, outfile_path=None, num_systems=None, val_metric=None)

(Deprecated) Cluster each system independently in a reduced embedding space.

…

compute_cluster_assignment_schemas(centers=None, labels=None, feature_matrix=None, indexes=None)

Parameters:

centersnp.ndarray, shape== (n_clusters, n_features): Cluster centers (e.g., from KMeans). If None (or if labels is None), centers and labels are obtained from perform_clust_opt().
labelsarray-like of int, shape=(n_samples,): Cluster assignment for each sample/frame.
feature_matrixnp.ndarray, shape=(n_samples, n_features): Per-sample feature matrix. If None, uses self.feature_matrix. Features are assumed to correspond to the upper-triangle residue–residue pairs (excluding diagonal), as produced by replicates_to_feature_matrix.
indexesarray-like of int: Residue indices used to label features. If None, uses self.indexes. Labels are generated as strings "i-j" in the same order as the upper-triangle feature construction.

Returns:

pandas.DataFrame

Tidy table with one row per (centroid, pair) containing:

pair — residue–residue label "i-j" (str)
count — number of samples in that centroid for which this pair

was the maximum-contributing feature (int) * centroid — centroid/cluster id (int)

The table can be sorted/grouped to retrieve the top pairs per centroid.

Notes

Per-feature contributions are computed as (x_j - c_j)**2 for each

feature j and assigned sample x relative to its centroid c. * The “dominant pair” for a sample is the feature index with the maximum squared deviation (ties take the first occurrence although for our implementation use case there was not much of an issue).

Examples

>>> df.sort_values(["centroid","count"], ascending=[True, False])         ...   .groupby("centroid").head(20)  

create_PCA_ranked_weights(outfile_path=None, weights=None, indexes=None)

Create a ranked table of PCA feature weights for the first two principal components.

Parameters:

outfile_pathstr or pathlib.Path: Directory where outputs may be written. If None, uses the current working directory.
weightsnp.ndarray, shape== (n_components, n_features): PCA component loadings (rows = components, columns = features). If None, this function calls reduce_systems_representations() to compute PCA (default n=2) and uses the returned weights.
indexesarray-like of int: Residue indices used to label pairwise comparisons. If None, uses self.indexes. These indices define the order used to generate upper-triangle residue–residue comparison labels (e.g., “12-47”).

Returns:

pandas.DataFrame

A table mapping each feature (upper-triangle residue pair) to its PCA weights and magnitudes with columns:

Comparisons — ‘i-j’ residue pair label (str)
PC1_Weights — raw loading for PC1 (float)
PC2_Weights — raw loading for PC2 (float)
PC1_magnitude — (PC1_Weights)**2 (float)
PC2_magnitude — (PC2_Weights)**2 (float)
PC1_mag_norm — min–max normalized PC1_magnitude to [0, 1] (float) [optional if postprocessed]
PC2_mag_norm — min–max normalized PC2_magnitude to [0, 1] (float) [optional if postprocessed]

Notes

Only the upper triangle (excluding the diagonal) of the residue–residue matrix is used, so each row corresponds to a unique residue pair. The function assumes at least two principal components are available.

Examples

>>> df = sa.create_PCA_ranked_weights()

extract_hbond_values(residues, systems_array=None, mode='sum')

Aggregate per-frame H-bond values over a chosen residue set.

Parameters:

residueslist of int: Residue indices to retain (e.g., a surface or motif). All pairwise combinations among these residues are used.
systems_arraynp.ndarray, shape=``(n_frames, n_res, n_res)``: Averaged array or a single 3D trajectory array. If None, all systems in self.systems_representations are concatenated.
mode{‘sum’,’average’}, default=’sum’: Aggregation across the upper triangle for each frame.

Returns:

np.ndarray: Shape=``(n_frames,)`` containing the aggregated value per frame.

Notes

Index/label row/column 0 are dropped before aggregation.
Only the upper triangle is aggregated to avoid double counting.

Examples

>>> vals = sa.extract_hbond_values([12, 47, 91], mode='average')

pearson_sums_only(feature_space_coordinates=None, embedding_space_coordinates=None)

Compute Pearson r between ALL within-space pairwise Euclidean distances of feature space (X) and embedding space (Y) without materializing the O(n^2) distance vectors. We stream the five totals needed for Pearson.

Parameters:

feature_space_coordinatesnp.ndarray, shape=(n_samples, p): High-D per-sample feature rows. If None, we build self.feature_matrix.
embedding_space_coordinatesnp.ndarray, shape=(n_samples, q): Low-D per-sample embedding rows. If None, we build a PCA embedding.

Returns:

float: Pearson correlation coefficient r in [-1, 1] between pdist(X) and pdist(Y).

Notes

We iterate over the upper triangle: for each row i, compare to all rows j>i.
Distances are computed via ‖a-b‖² = ‖a‖² + ‖b‖² − 2·(a·b), so we only need

squared row norms and a BLAS-backed dot product (Xj @ xi). * We never keep the giant distance vectors, only the 5 running sums.

Examples

>>> r = obj.pearson_sums_only(X, Y)  # exact Pearson of pdist(X) vs pdist(Y)

perform_clust_opt(outfile_path, max_clusters=None, data=None, k=None)

Sweep K for KMeans and pick “best” K by silhouette and elbow, or fit a single K.

Parameters:

outfile_pathstr or pathlib.Path: If provided, per-K label arrays are saved via np.save as "<outfile_path>kluster_labels_{K}clust.npy" and selection plots are delegated to mdsa_tools.Viz. If None, nothing is written and elbow selection falls back to a simple heuristic.
max_clustersint: Upper bound (inclusive) for the K sweep (default 10). The sweep runs from K=2 through K=max_clusters.
datanp.ndarray, shape=(n_samples, n_features): Feature matrix to cluster. If None, uses self.feature_matrix.
kint: If set, skip the sweep and fit one KMeans at exactly k clusters.

Returns:

If k is None:

optimal_k_silhouette_labelsnp.ndarray of int, shape=(n_samples,): Labels for the silhouette-selected K.
optimal_k_elbow_labelsnp.ndarray of int, shape=(n_samples,): Labels for the elbow-selected K.
centers_sillohuettenp.ndarray, shape=(K_sil, n_features): Cluster centers for the silhouette-selected K. (Legacy spelling kept.)
centers_elbownp.ndarray, shape=(K_elb, n_features): Cluster centers for the elbow-selected K.

If k is not None:

cluster_labelsnp.ndarray of int, shape=(n_samples,): Labels from the single KMeans run.
cluster_centersnp.ndarray, shape=(k, n_features): Cluster centers from the single KMeans run.

Notes

Silhouette uses sklearn.metrics.silhouette_score on the provided feature space (no precomputed distance matrix).
Elbow choice:
- If outfile_path is provided, the decision is delegated to mdsa_tools.Viz.plot_elbow_scores (with plotting).
- If not, and too few points exist to fit a curve, falls back to the argmin of inertia over the available Ks.
KMeans(init='random', n_init=K, random_state=0) is used for reproducibility.
The variable/name sillohuette is spelled as-is to preserve legacy usage.

Examples

>>> labels_sil, labels_elb, C_sil, C_elb = sa.perform_clust_opt(None, max_clusters=8, data=X)
>>> labels, centers = sa.perform_clust_opt(None, data=X, k=3)

perform_kmeans(outfile_path=None, max_clusters=None, data=None, k=None)

Run KMeans clustering on a feature matrix, either: (a) sweeping K to select optima by silhouette and elbow criteria, or (b) fitting once at a fixed K.

Parameters:

outfile_pathstr or pathlib.Path: Directory where per-K label arrays are saved (via np.save) when sweeping K (i.e., when k is None). One file is written per value of K. If None, nothing is written to disk. Default is None.
max_clustersint: Upper bound on the number of clusters to consider when sweeping K. Effective only when k is None. The exact K range is determined by perform_clust_opt() (typically 2..max_clusters inclusive). Default is 10.
dataarray-like of shape=(n_samples, n_features): Feature matrix to cluster. If None, uses self.feature_matrix.
kint: If provided, fit a single KMeans model with exactly k clusters and return its labels and centers. If None, perform a sweep over K and return the best solutions under the silhouette and elbow criteria.

Returns:

cluster_labelsndarray of shape=(n_samples,), dtype=int: (Only when k is not None) Cluster assignment for each sample, with labels in [0, k-1].
cluster_centersndarray of shape=(k, n_features): (Only when k is not None) Centroids of the fitted model.
optimal_k_silhouette_labelsndarray of shape=(n_samples,), dtype=int: (Only when k is None) Labels for the K chosen by the silhouette criterion.
optimal_k_elbow_labelsndarray of shape=(n_samples,), dtype=int: (Only when k is None) Labels for the K chosen by the elbow (inertia) criterion.
centers_sillohuettendarray of shape=(k_silhouette, n_features): (Only when k is None) Centers corresponding to optimal_k_silhouette_labels.
centers_elbowndarray of shape=(k_elbow, n_features): (Only when k is None) Centers corresponding to optimal_k_elbow_labels.

Notes

When k is provided, this method fits sklearn.cluster.KMeans with init='random', n_init=k, and random_state=0 for reproducibility.
When k is None, selection of the optimal K and any per-K saving behavior are delegated to perform_clust_opt(), which is expected to return the label arrays and centers for the silhouette- and elbow-selected solutions.

Examples

Fit a fixed-K model:

>>> labels, centers = obj.perform_kmeans(k=5)

Sweep K (saving per-K labels):

>>> sil_labels, elbow_labels, sil_centers, elbow_centers =         ...     obj.perform_kmeans(outfile_path="results/kmeans", max_clusters=12)

perform_optimized_UMAP_global(feature_matrix=None, max_neighbors=None, min_neighbors=None, stepsize=None, min_dist_values=None)

Grid-search UMAP over n_neighbors and min_dist, and for each embedding compute Pearson r between pdist(feature space) and pdist(embedding space). Returns a tidy DataFrame for easy plotting (bubble chart etc.).

Parameters:

feature_matrixnp.ndarray, shape=(n_samples, n_features): If None, use self.feature_matrix (and build it if needed).
max_neighborsint: Exclusive upper bound for the neighbors sweep. Default 100.
min_neighborsint: Inclusive lower bound for the neighbors sweep. Default 10.
stepsizeint: Step between neighbor counts. Default 10 (i.e., 10, 20, 30, …).
min_dist_valuesiterable of float: Set of UMAP min_dist values to try. Default (0.1, 0.4, 0.7, 1.0).

Returns:

pandas.DataFrame: Columns: - n_neighbors (int) - min_dist (float) - pearson_r (float, rounded to 2 decimals) - r01_for_bubbles (float in [0,100]): visual helper = 100*((r+1)/2)

Notes

UMAP embeddings depend on (n_neighbors, min_dist); we loop both.
Pearson is computed with pearson_sums_only (streaming, exact).
r01_for_bubbles maps r∈[-1,1] -> [0,100] for bubble size/color.

perform_optimized_UMAP_local(feature_matrix=None, max_neighbors=None, min_neighbors=None, stepsize=None, min_dist_values=None, eval_n=None)

Grid-search UMAP over n_neighbors and min_dist, and for each embedding compute Pearson r between pdist(feature space) and pdist(embedding space). Returns a tidy DataFrame for easy plotting (bubble chart etc.).

Parameters:

feature_matrixnp.ndarray, shape=(n_samples, n_features): If None, use self.feature_matrix (and build it if needed).
max_neighborsint: Exclusive upper bound for the neighbors sweep. Default 100.
min_neighborsint: Inclusive lower bound for the neighbors sweep. Default 10.
stepsizeint: Step between neighbor counts. Default 10 (i.e., 10, 20, 30, …).
min_dist_valuesiterable of float: Set of UMAP min_dist values to try. Default (0.1, 0.4, 0.7, 1.0).

Returns:

pandas.DataFrame: Columns: - n_neighbors (int) - min_dist (float) - pearson_r (float, rounded to 2 decimals) - r01_for_bubbles (float in [0,100]): visual helper = 100*((r+1)/2)

Notes

UMAP embeddings depend on (n_neighbors, min_dist); we loop both.
Pearson is computed with pearson_sums_only (streaming, exact).
r01_for_bubbles maps r∈[-1,1] -> [0,100] for bubble size/color.

reduce_systems_representations(feature_matrix=None, method=None, n_components=None, min_dist=None, n_neighbors=None)

Reduce the dimensionality of a per-frame feature matrix using PCA or UMAP.

Parameters:

feature_matrixarray-like of shape=(n_samples, n_features): Matrix to reduce. If omitted, uses self.feature_matrix; if that is not set, calls replicates_to_feature_matrix() to build it.
method{‘PCA’, ‘UMAP’}, default ‘PCA’: Dimensionality-reduction algorithm.
n_componentsint, default 2: Target embedding dimensionality.
min_distfloat: UMAP min_dist hyperparameter controlling how tightly points cluster (smaller ⇒ tighter clusters). Ignored for PCA. Default 0.5.
n_neighborsint: UMAP neighborhood size controlling local vs global structure. Ignored for PCA. Default 900.

Returns:

If method == ‘PCA’:

(X_pca, weights, explained_variance_ratio_)tuple

X_pcandarray of shape=(n_samples, n_components): PCA scores.
weightsndarray of shape=(n_components, n_features): Component loadings.
explained_variance_ratio_ndarray of shape=(n_components,): Fraction of variance explained by each principal component.

If method == ‘UMAP’:

embeddingndarray of shape=(n_samples, n_components): Low-dimensional embedding.

Notes

PCA uses sklearn.decomposition.PCA.
UMAP uses umap.UMAP with the provided hyperparameters.

Examples

>>> X, W, evr = sa.reduce_systems_representations(method='PCA', n_components=2)
>>> U = sa.reduce_systems_representations(method='UMAP', n_components=2)

replicates_to_feature_matrix(arrays=None) → ndarray

Construct a flattened, per-frame feature matrix from one or more systems of residue×residue adjacency matrices.

Parameters:

arrayslist of np.ndarray: [array_1, ..., array_m] where each array has shape (n_frames, n_residues, n_residues). If None, uses self.systems_representations. The only axis allowed to differ across arrays is n_frames.

Returns:

np.ndarray: Shape=``(sum(n_frames), n_features)``, where n_features equals the number of unique residue–residue pairs from the upper triangle (excluding the diagonal) of the per-frame matrix. Each row corresponds to a single frame across all systems.

Notes

Index row/column 0 are dropped ([1:, 1:]) under the assumption they hold metadata (residue indices).
Only the upper triangle is used to avoid duplicate symmetric entries.
The result is cached to self.feature_matrix for reuse.

Examples

>>> arrays = [sys1, sys2]  # each (n_frames, n_res, n_res)
>>> X = sa.replicates_to_feature_matrix(arrays)

run_PCA(feature_matrix, n)

Run Principal Components Analysis (PCA) and return transformed coordinates, component loadings, and explained variance ratios.

Parameters:

feature_matrixnp.ndarray: Shape=``(n_samples, n_features)``. Typically produced by replicates_to_feature_matrix().
nint: Number of components to retain (default 2 in upstream callers).

Returns:

X_pcanp.ndarray, shape=(n_samples, n): Transformed coordinates in the principal-component space.
weightsnp.ndarray, shape=(n, n_features): PCA component loadings (rows = components, columns = original features).
explained_variancesnp.ndarray, shape=(n,): Fraction of variance explained by each selected component.

Notes

Thin wrapper around sklearn.decomposition.PCA; fits on the provided feature matrix and returns the transformed coordinates, component loadings, and explained-variance ratios.

Examples

>>> X_pca, W, evr = sa.run_PCA(feature_matrix, 2)
>>> X_pca.shape=
(feature_matrix.shape[0], 2)

t_test_autocorrelation(results=None, m=None, alpha=None)