mdsa_tools.Analysis
High-level wrapper for storing and running common analyses over systems of residue–residue adjacency matrices (e.g., H-bond networks).
You can follow the provided pipeline end-to-end, or call individual, modular
steps (flattening → feature matrix → dimensionality reduction → clustering).
For example, you might cluster arbitrary n-dimensional feature matrices, or
pull H-bond values via systems_analysis.extract_hbond_values() and use
those in replicate maps instead of k-means cluster assignments.
See Also
mdsa_tools.Viz.visualize_reduction : Plot PCA/UMAP embeddings. mdsa_tools.Data_gen_hbond.create_system_representations : Build residue–residue H-bond adjacency matrices. numpy.linalg.svd : Linear algebra used under the hood.
Classes
|
Container for systems-level analyses on residue–residue adjacency matrices. |
- class mdsa_tools.Analysis.systems_analysis(systems_representations=None, replicate_distribution=None, precomputed_feature_matrix=None, res_indexes_for_precomputedmatrix=None)
Bases:
objectContainer for systems-level analyses on residue–residue adjacency matrices.
This class assumes each “system representation” is a 3D array with shape
(n_frames, n_residues, n_residues)where a single frame contains a residue×residue adjacency matrix (e.g., H-bond counts/weights). Many methods operate on a feature matrix constructed by flattening/stacking per-frame upper triangles.- Parameters:
- systems_representationslist of np.ndarray
[array_1, ..., array_m], each of shape(n_frames, n_residues, n_residues). The only permitted difference across arrays isn_frames; all systems should share the same residue dimension. Index metadata are expected in row/column 0.- replicate_distributionarray-like of int
Optional labels or indices describing how frames are distributed across replicates/systems. If
None, defaults tonp.arange(0, systems_representations[0].shape[0]).
- Attributes:
- num_systemsint
Number of systems provided (
len(systems_representations)).- systems_representationslist of np.ndarray
Original input, retained for convenience.
- indexesnp.ndarray
Derived from
systems_representations[0][0, 0, 1:]. These are used to label residue–residue comparisons when generating feature names.- feature_matrixnp.ndarray
Cached 2D feature matrix produced by
replicates_to_feature_matrix.- replicate_distributionnp.ndarray
Distribution vector set from the provided argument or default.
Methods
assignment_schemas([centers, labels, ...])autocorrelation_function([...])cluster_embeddingspace([...])(Deprecated) Cluster each system independently in a reduced embedding space.
create_PCA_ranked_weights([outfile_path, ...])Create a ranked table of PCA feature weights for the first two principal components.
extract_hbond_values(residues[, ...])Aggregate per-frame H-bond values over a chosen residue set.
pearson_sums_only([...])Compute Pearson r between ALL within-space pairwise Euclidean distances of feature space (X) and embedding space (Y) without materializing the O(n^2) distance vectors.
perform_clust_opt(outfile_path[, ...])Sweep K for KMeans and pick “best” K by silhouette and elbow, or fit a single K.
perform_kmeans([outfile_path, max_clusters, ...])Run KMeans clustering on a feature matrix, either: (a) sweeping K to select optima by silhouette and elbow criteria, or (b) fitting once at a fixed K.
Grid-search UMAP over n_neighbors and min_dist, and for each embedding compute Pearson r between pdist(feature space) and pdist(embedding space).
perform_optimized_UMAP_local([...])Grid-search UMAP over n_neighbors and min_dist, and for each embedding compute Pearson r between pdist(feature space) and pdist(embedding space).
Reduce the dimensionality of a per-frame feature matrix using PCA or UMAP.
replicates_to_feature_matrix([arrays])Construct a flattened, per-frame feature matrix from one or more systems of residue×residue adjacency matrices.
run_PCA(feature_matrix, n)Run Principal Components Analysis (PCA) and return transformed coordinates, component loadings, and explained variance ratios.
t_test_autocorrelation
Notes
Many operations expect systems of the same residue size.
Designed for comparative analyses but also works with a single system.
The class focuses on simple, modular wrappers around common steps (flattening, reduction, clustering) for downstream use.
- assignment_schemas(centers=None, labels=None, feature_matrix=None, indexes=None)
- Parameters:
- centersnp.ndarray, shape== (n_clusters, n_features)
Cluster centers (e.g., from KMeans). If
None(or iflabelsisNone), centers and labels are obtained fromperform_clust_opt().- labelsarray-like of int, shape=(n_samples,)
Cluster assignment for each sample/frame.
- feature_matrixnp.ndarray, shape=(n_samples, n_features)
Per-sample feature matrix. If
None, usesself.feature_matrix. Features are assumed to correspond to the upper-triangle residue–residue pairs (excluding diagonal), as produced byreplicates_to_feature_matrix.- indexesarray-like of int
Residue indices used to label features. If
None, usesself.indexes. Labels are generated as strings"i-j"in the same order as the upper-triangle feature construction.
- Returns:
- pandas.DataFrame
Tidy table with one row per (centroid, pair) containing:
pair— residue–residue label"i-j"(str)count— number of samples in that centroid for which this pair
was the maximum-contributing feature (int) *
centroid— centroid/cluster id (int)The table can be sorted/grouped to retrieve the top pairs per centroid.
Notes
Per-feature contributions are computed as
(x_j - c_j)**2for each
feature
jand assigned samplexrelative to its centroidc. * The “dominant pair” for a sample is the feature index with the maximum squared deviation (ties take the first occurrence although for our implementation use case there was not much of an issue).Examples
>>> labels, centers = sa.perform_kmeans(k=2) >>> df = sa.compute_cluster_assignment_schemas(centers, labels) >>> df.sort_values(["centroid","count"], ascending=[True, False]) ... .groupby("centroid").head(20)
- autocorrelation_function(feature_space_coordinates=None, lags=None)
- Parameters:
- Returns:
- cluster_embeddingspace(reduced_coordinates=None, outfile_path=None, num_systems=None, val_metric=None)
(Deprecated) Cluster each system independently in a reduced embedding space.
…
- compute_cluster_assignment_schemas(centers=None, labels=None, feature_matrix=None, indexes=None)
- Parameters:
- centersnp.ndarray, shape== (n_clusters, n_features)
Cluster centers (e.g., from KMeans). If
None(or iflabelsisNone), centers and labels are obtained fromperform_clust_opt().- labelsarray-like of int, shape=(n_samples,)
Cluster assignment for each sample/frame.
- feature_matrixnp.ndarray, shape=(n_samples, n_features)
Per-sample feature matrix. If
None, usesself.feature_matrix. Features are assumed to correspond to the upper-triangle residue–residue pairs (excluding diagonal), as produced byreplicates_to_feature_matrix.- indexesarray-like of int
Residue indices used to label features. If
None, usesself.indexes. Labels are generated as strings"i-j"in the same order as the upper-triangle feature construction.
- Returns:
- pandas.DataFrame
Tidy table with one row per (centroid, pair) containing:
pair— residue–residue label"i-j"(str)count— number of samples in that centroid for which this pair
was the maximum-contributing feature (int) *
centroid— centroid/cluster id (int)The table can be sorted/grouped to retrieve the top pairs per centroid.
Notes
Per-feature contributions are computed as
(x_j - c_j)**2for each
feature
jand assigned samplexrelative to its centroidc. * The “dominant pair” for a sample is the feature index with the maximum squared deviation (ties take the first occurrence although for our implementation use case there was not much of an issue).Examples
>>> df.sort_values(["centroid","count"], ascending=[True, False]) ... .groupby("centroid").head(20)
- create_PCA_ranked_weights(outfile_path=None, weights=None, indexes=None)
Create a ranked table of PCA feature weights for the first two principal components.
- Parameters:
- outfile_pathstr or pathlib.Path
Directory where outputs may be written. If
None, uses the current working directory.- weightsnp.ndarray, shape== (n_components, n_features)
PCA component loadings (rows = components, columns = features). If
None, this function callsreduce_systems_representations()to compute PCA (default n=2) and uses the returnedweights.- indexesarray-like of int
Residue indices used to label pairwise comparisons. If
None, usesself.indexes. These indices define the order used to generate upper-triangle residue–residue comparison labels (e.g., “12-47”).
- Returns:
- pandas.DataFrame
A table mapping each feature (upper-triangle residue pair) to its PCA weights and magnitudes with columns:
Comparisons— ‘i-j’ residue pair label (str)PC1_Weights— raw loading for PC1 (float)PC2_Weights— raw loading for PC2 (float)PC1_magnitude— (PC1_Weights)**2 (float)PC2_magnitude— (PC2_Weights)**2 (float)PC1_mag_norm— min–max normalized PC1_magnitude to [0, 1] (float) [optional if postprocessed]PC2_mag_norm— min–max normalized PC2_magnitude to [0, 1] (float) [optional if postprocessed]
Notes
Only the upper triangle (excluding the diagonal) of the residue–residue matrix is used, so each row corresponds to a unique residue pair. The function assumes at least two principal components are available.
Examples
>>> df = sa.create_PCA_ranked_weights()
- extract_hbond_values(residues, systems_array=None, mode='sum')
Aggregate per-frame H-bond values over a chosen residue set.
- Parameters:
- residueslist of int
Residue indices to retain (e.g., a surface or motif). All pairwise combinations among these residues are used.
- systems_arraynp.ndarray, shape=``(n_frames, n_res, n_res)``
Averaged array or a single 3D trajectory array. If
None, all systems inself.systems_representationsare concatenated.- mode{‘sum’,’average’}, default=’sum’
Aggregation across the upper triangle for each frame.
- Returns:
- np.ndarray
Shape=``(n_frames,)`` containing the aggregated value per frame.
Notes
Index/label row/column
0are dropped before aggregation.Only the upper triangle is aggregated to avoid double counting.
Examples
>>> vals = sa.extract_hbond_values([12, 47, 91], mode='average')
- pearson_sums_only(feature_space_coordinates=None, embedding_space_coordinates=None)
Compute Pearson r between ALL within-space pairwise Euclidean distances of feature space (X) and embedding space (Y) without materializing the O(n^2) distance vectors. We stream the five totals needed for Pearson.
- Parameters:
- feature_space_coordinatesnp.ndarray, shape=(n_samples, p)
High-D per-sample feature rows. If None, we build self.feature_matrix.
- embedding_space_coordinatesnp.ndarray, shape=(n_samples, q)
Low-D per-sample embedding rows. If None, we build a PCA embedding.
- Returns:
- float
Pearson correlation coefficient r in [-1, 1] between pdist(X) and pdist(Y).
Notes
We iterate over the upper triangle: for each row i, compare to all rows j>i.
Distances are computed via ‖a-b‖² = ‖a‖² + ‖b‖² − 2·(a·b), so we only need
squared row norms and a BLAS-backed dot product (Xj @ xi). * We never keep the giant distance vectors, only the 5 running sums.
Examples
>>> r = obj.pearson_sums_only(X, Y) # exact Pearson of pdist(X) vs pdist(Y)
- perform_clust_opt(outfile_path, max_clusters=None, data=None, k=None)
Sweep K for KMeans and pick “best” K by silhouette and elbow, or fit a single K.
- Parameters:
- outfile_pathstr or pathlib.Path
If provided, per-K label arrays are saved via
np.saveas"<outfile_path>kluster_labels_{K}clust.npy"and selection plots are delegated tomdsa_tools.Viz. IfNone, nothing is written and elbow selection falls back to a simple heuristic.- max_clustersint
Upper bound (inclusive) for the K sweep (default
10). The sweep runs fromK=2throughK=max_clusters.- datanp.ndarray, shape=(n_samples, n_features)
Feature matrix to cluster. If
None, usesself.feature_matrix.- kint
If set, skip the sweep and fit one KMeans at exactly
kclusters.
- Returns:
- If
kis None: - optimal_k_silhouette_labelsnp.ndarray of int, shape=(n_samples,)
Labels for the silhouette-selected K.
- optimal_k_elbow_labelsnp.ndarray of int, shape=(n_samples,)
Labels for the elbow-selected K.
- centers_sillohuettenp.ndarray, shape=(K_sil, n_features)
Cluster centers for the silhouette-selected K. (Legacy spelling kept.)
- centers_elbownp.ndarray, shape=(K_elb, n_features)
Cluster centers for the elbow-selected K.
- If
kis not None: - cluster_labelsnp.ndarray of int, shape=(n_samples,)
Labels from the single KMeans run.
- cluster_centersnp.ndarray, shape=(k, n_features)
Cluster centers from the single KMeans run.
- If
Notes
Silhouette uses
sklearn.metrics.silhouette_scoreon the provided feature space (no precomputed distance matrix).- Elbow choice:
If
outfile_pathis provided, the decision is delegated tomdsa_tools.Viz.plot_elbow_scores(with plotting).If not, and too few points exist to fit a curve, falls back to the argmin of inertia over the available Ks.
KMeans(init='random', n_init=K, random_state=0)is used for reproducibility.The variable/name
sillohuetteis spelled as-is to preserve legacy usage.
Examples
>>> labels_sil, labels_elb, C_sil, C_elb = sa.perform_clust_opt(None, max_clusters=8, data=X) >>> labels, centers = sa.perform_clust_opt(None, data=X, k=3)
- perform_kmeans(outfile_path=None, max_clusters=None, data=None, k=None)
Run KMeans clustering on a feature matrix, either: (a) sweeping K to select optima by silhouette and elbow criteria, or (b) fitting once at a fixed K.
- Parameters:
- outfile_pathstr or pathlib.Path
Directory where per-K label arrays are saved (via
np.save) when sweeping K (i.e., whenk is None). One file is written per value of K. IfNone, nothing is written to disk. Default isNone.- max_clustersint
Upper bound on the number of clusters to consider when sweeping K. Effective only when
k is None. The exact K range is determined byperform_clust_opt()(typically2..max_clustersinclusive). Default is10.- dataarray-like of shape=(n_samples, n_features)
Feature matrix to cluster. If
None, usesself.feature_matrix.- kint
If provided, fit a single KMeans model with exactly
kclusters and return its labels and centers. IfNone, perform a sweep over K and return the best solutions under the silhouette and elbow criteria.
- Returns:
- cluster_labelsndarray of shape=(n_samples,), dtype=int
(Only when
kis notNone) Cluster assignment for each sample, with labels in[0, k-1].- cluster_centersndarray of shape=(k, n_features)
(Only when
kis notNone) Centroids of the fitted model.- optimal_k_silhouette_labelsndarray of shape=(n_samples,), dtype=int
(Only when
kisNone) Labels for the K chosen by the silhouette criterion.- optimal_k_elbow_labelsndarray of shape=(n_samples,), dtype=int
(Only when
kisNone) Labels for the K chosen by the elbow (inertia) criterion.- centers_sillohuettendarray of shape=(k_silhouette, n_features)
(Only when
kisNone) Centers corresponding tooptimal_k_silhouette_labels.- centers_elbowndarray of shape=(k_elbow, n_features)
(Only when
kisNone) Centers corresponding tooptimal_k_elbow_labels.
Notes
When
kis provided, this method fitssklearn.cluster.KMeanswithinit='random',n_init=k, andrandom_state=0for reproducibility.When
kisNone, selection of the optimal K and any per-K saving behavior are delegated toperform_clust_opt(), which is expected to return the label arrays and centers for the silhouette- and elbow-selected solutions.
Examples
Fit a fixed-K model:
>>> labels, centers = obj.perform_kmeans(k=5)
Sweep K (saving per-K labels):
>>> sil_labels, elbow_labels, sil_centers, elbow_centers = ... obj.perform_kmeans(outfile_path="results/kmeans", max_clusters=12)
- perform_optimized_UMAP_global(feature_matrix=None, max_neighbors=None, min_neighbors=None, stepsize=None, min_dist_values=None)
Grid-search UMAP over n_neighbors and min_dist, and for each embedding compute Pearson r between pdist(feature space) and pdist(embedding space). Returns a tidy DataFrame for easy plotting (bubble chart etc.).
- Parameters:
- feature_matrixnp.ndarray, shape=(n_samples, n_features)
If None, use self.feature_matrix (and build it if needed).
- max_neighborsint
Exclusive upper bound for the neighbors sweep. Default 100.
- min_neighborsint
Inclusive lower bound for the neighbors sweep. Default 10.
- stepsizeint
Step between neighbor counts. Default 10 (i.e., 10, 20, 30, …).
- min_dist_valuesiterable of float
Set of UMAP min_dist values to try. Default (0.1, 0.4, 0.7, 1.0).
- Returns:
- pandas.DataFrame
Columns: - n_neighbors (int) - min_dist (float) - pearson_r (float, rounded to 2 decimals) - r01_for_bubbles (float in [0,100]): visual helper = 100*((r+1)/2)
Notes
UMAP embeddings depend on (n_neighbors, min_dist); we loop both.
Pearson is computed with pearson_sums_only (streaming, exact).
r01_for_bubbles maps r∈[-1,1] -> [0,100] for bubble size/color.
- perform_optimized_UMAP_local(feature_matrix=None, max_neighbors=None, min_neighbors=None, stepsize=None, min_dist_values=None, eval_n=None)
Grid-search UMAP over n_neighbors and min_dist, and for each embedding compute Pearson r between pdist(feature space) and pdist(embedding space). Returns a tidy DataFrame for easy plotting (bubble chart etc.).
- Parameters:
- feature_matrixnp.ndarray, shape=(n_samples, n_features)
If None, use self.feature_matrix (and build it if needed).
- max_neighborsint
Exclusive upper bound for the neighbors sweep. Default 100.
- min_neighborsint
Inclusive lower bound for the neighbors sweep. Default 10.
- stepsizeint
Step between neighbor counts. Default 10 (i.e., 10, 20, 30, …).
- min_dist_valuesiterable of float
Set of UMAP min_dist values to try. Default (0.1, 0.4, 0.7, 1.0).
- Returns:
- pandas.DataFrame
Columns: - n_neighbors (int) - min_dist (float) - pearson_r (float, rounded to 2 decimals) - r01_for_bubbles (float in [0,100]): visual helper = 100*((r+1)/2)
Notes
UMAP embeddings depend on (n_neighbors, min_dist); we loop both.
Pearson is computed with pearson_sums_only (streaming, exact).
r01_for_bubbles maps r∈[-1,1] -> [0,100] for bubble size/color.
- reduce_systems_representations(feature_matrix=None, method=None, n_components=None, min_dist=None, n_neighbors=None)
Reduce the dimensionality of a per-frame feature matrix using PCA or UMAP.
- Parameters:
- feature_matrixarray-like of shape=(n_samples, n_features)
Matrix to reduce. If omitted, uses
self.feature_matrix; if that is not set, callsreplicates_to_feature_matrix()to build it.- method{‘PCA’, ‘UMAP’}, default ‘PCA’
Dimensionality-reduction algorithm.
- n_componentsint, default 2
Target embedding dimensionality.
- min_distfloat
UMAP
min_disthyperparameter controlling how tightly points cluster (smaller ⇒ tighter clusters). Ignored for PCA. Default0.5.- n_neighborsint
UMAP neighborhood size controlling local vs global structure. Ignored for PCA. Default
900.
- Returns:
- If method == ‘PCA’:
- (X_pca, weights, explained_variance_ratio_)tuple
- X_pcandarray of shape=(n_samples, n_components)
PCA scores.
- weightsndarray of shape=(n_components, n_features)
Component loadings.
- explained_variance_ratio_ndarray of shape=(n_components,)
Fraction of variance explained by each principal component.
- If method == ‘UMAP’:
- embeddingndarray of shape=(n_samples, n_components)
Low-dimensional embedding.
Notes
PCA uses
sklearn.decomposition.PCA.UMAP uses
umap.UMAPwith the provided hyperparameters.
Examples
>>> X, W, evr = sa.reduce_systems_representations(method='PCA', n_components=2) >>> U = sa.reduce_systems_representations(method='UMAP', n_components=2)
- replicates_to_feature_matrix(arrays=None) ndarray
Construct a flattened, per-frame feature matrix from one or more systems of residue×residue adjacency matrices.
- Parameters:
- arrayslist of np.ndarray
[array_1, ..., array_m]where each array has shape(n_frames, n_residues, n_residues). IfNone, usesself.systems_representations. The only axis allowed to differ across arrays isn_frames.
- Returns:
- np.ndarray
Shape=``(sum(n_frames), n_features)``, where
n_featuresequals the number of unique residue–residue pairs from the upper triangle (excluding the diagonal) of the per-frame matrix. Each row corresponds to a single frame across all systems.
Notes
Index row/column
0are dropped ([1:, 1:]) under the assumption they hold metadata (residue indices).Only the upper triangle is used to avoid duplicate symmetric entries.
The result is cached to
self.feature_matrixfor reuse.
Examples
>>> arrays = [sys1, sys2] # each (n_frames, n_res, n_res) >>> X = sa.replicates_to_feature_matrix(arrays)
- run_PCA(feature_matrix, n)
Run Principal Components Analysis (PCA) and return transformed coordinates, component loadings, and explained variance ratios.
- Parameters:
- feature_matrixnp.ndarray
Shape=``(n_samples, n_features)``. Typically produced by
replicates_to_feature_matrix().- nint
Number of components to retain (default
2in upstream callers).
- Returns:
- X_pcanp.ndarray, shape=(n_samples, n)
Transformed coordinates in the principal-component space.
- weightsnp.ndarray, shape=(n, n_features)
PCA component loadings (rows = components, columns = original features).
- explained_variancesnp.ndarray, shape=(n,)
Fraction of variance explained by each selected component.
Notes
Thin wrapper around
sklearn.decomposition.PCA; fits on the provided feature matrix and returns the transformed coordinates, component loadings, and explained-variance ratios.Examples
>>> X_pca, W, evr = sa.run_PCA(feature_matrix, 2) >>> X_pca.shape= (feature_matrix.shape[0], 2)
- t_test_autocorrelation(results=None, m=None, alpha=None)