You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This report analyzes structural and dependency anomalies across multiple abstraction levels of the codebase.
6
8
The goal is to detect potential **software quality, design, and architecture issues** using graph-based features, anomaly detection (Isolation Forest), and SHAP explainability.
7
9
10
+
## 📚 Table of Contents
11
+
12
+
1.[Executive Overview](#1-executive-overview)
13
+
1.[Deep Dives by Abstraction Level](#2-deep-dives-by-abstraction-level)
***Refactor hubs:** Break down god classes/utilities into smaller abstractions.
34
-
***Mitigate bottlenecks:** Add redundancy or alternative paths.
35
-
***Investigate outliers:** Validate if they are justified exceptions or design flaws.
36
-
***Enforce cohesion:** Raise clustering coefficient via better modular boundaries.
37
-
***Stabilize authorities:** Encapsulate widely used but locally weak components, reduce over-generalization, and ensure stable APIs.
38
-
***Clarify bridges:** Validate whether cross-cluster connectors are intentional (adapters/facades) or accidental; refactor or relocate responsibilities to preserve modularity.
39
-
40
-
---
41
-
42
-
## 5. Appendix
43
-
44
-
***Methodology:** Isolation Forest, Random Forest proxy, SHAP explanations.
45
-
***Embedding generation:** Fast Random Projection, PCA (20–35 dims, \~0.9 target variance).
46
-
***Clustering:** HDBSCAN tuned against Leiden communities (golden reference, AMI optimization).
47
-
***Optimization:** Hyperparameter optimization for both Isolation Forest and Random Forest proxy with their F1 score
|**Cluster Metrics**| Cluster characteristics | Radius, cohesion, noise | Identifies weakly defined or noisy clusters |
47
+
48
+
## 3. Plot Interpretation Guide
49
+
50
+
> **Purpose:** Provide a direct mapping between all plots and their analytical meaning.
51
+
> **Scope:** Applies to plots for *Java Type*, *Java Package*, and similar abstraction levels.
52
+
> **Format:** Each entry includes `Best for`, `Adds`, and `Why`, matching the in-report descriptions.
53
+
54
+
---
55
+
56
+
### 📘 Main Plots
57
+
58
+
| Plot | Description | Best For | Adds | Why |
59
+
|------|--------------|----------|------|-----|
60
+
|**Anomalies**| 2D visualization of all code units showing clusters and anomalies. | Understanding the overall distribution of anomalies in relation to clusters. | Context of clusters and outliers. | Reveals whether anomalies are isolated or cluster-based, guiding investigation. |
61
+
|**Global Feature Importance (SHAP Summary)**| Mean absolute SHAP values ranking global feature impact. | Global understanding of which features drive anomalies. | Direction of impact (color shows feature value). | Explains which metrics consistently influence anomaly detection. |
62
+
|**Feature Dependence (Top Important Features)**| Shows how specific feature values affect anomaly score; colored by interacting feature. | Understanding how one feature affects anomaly scores. | Color shows feature interaction or threshold effect. | Helps identify nonlinear relationships and feature interactions. |
63
+
64
+
---
65
+
66
+
### 📙 Local Explanation Plots
67
+
68
+
| Plot | Description | Best For | Adds | Why |
69
+
|------|--------------|----------|------|-----|
70
+
|**Local SHAP Force Plots (Top Anomalies 1–6)**| Visualizes per-feature contributions to each anomaly’s score relative to baseline. | Explaining *why a specific data point* is anomalous. | Visual breakdown of how each feature contributes to anomaly score. | Enables debugging of individual anomalies through transparent explanation. |
71
+
72
+
---
73
+
74
+
### 📗 Cluster-Level Diagnostic Plots
75
+
76
+
| Plot | Description | Best For | Adds | Why |
77
+
|------|--------------|----------|------|-----|
78
+
|**Clusters – Overall**| Shows all clusters since they all fit into one plot. | Gaining a holistic view of cluster characteristics in the dataset. | An overall summary of how all clusters are distributed and their key metrics. | Understanding the general structure and properties of clusters can help identify patterns and potential anomalies in the data. |
79
+
|**Clusters – Largest Average Radius**| Ranks clusters by mean distance of members from their centroid. | Getting an overview of clusters that are more dispersed. | Identifies clusters with internal variability. | Large average radius suggests less cohesion and potential outliers. |
80
+
|**Clusters – Largest Max Radius**| Shows clusters with the farthest outlying member. | Identifying clusters that have members farthest from cluster center. | Highlights clusters containing extreme outliers. | Indicates clusters that may contain hidden anomalies. |
81
+
|**Clusters – Largest Size**| Displays cluster membership counts. | Understanding which clusters contain the most code units. | Provides sense of frequency of code structures. | Large clusters may represent common design patterns; small clusters are specialized. |
82
+
|**Cluster Probabilities**| Distribution of HDBSCAN membership probabilities. | Detecting code units that don’t strongly belong to any cluster. | Measures how well-defined clusters are. | Highlights noisy or weakly defined clusters. |
83
+
84
+
---
85
+
86
+
### 📒 Cluster Noise & Bridge Diagnostics
87
+
88
+
| Plot | Description | Best For | Adds | Why |
89
+
|------|--------------|----------|------|-----|
90
+
|**Cluster Noise – Highly Central and Popular**| Central nodes that don’t fit any cluster. | Detecting code units that are highly connected but anomalous. | Reveals influential but misfit nodes. | Such nodes may be key but unstable integration points. |
91
+
|**Cluster Noise – Poorly Integrated Bridges**| Nodes connecting clusters but weakly integrated. | Detecting code units that bridge modules unusually. | Identifies cross-cutting or leaking dependencies. | May reveal architectural boundary violations. |
92
+
|**Cluster Noise – Role Inverted Bridges**| Bridges with reversed structural roles compared to expected topology. | Detecting code units connecting clusters in unexpected ways. | Highlights anomalous coupling roles. | Indicates architectural inversion or misuse of interfaces. |
93
+
94
+
---
95
+
96
+
### 📙 Feature Distribution & Relationship Plots
97
+
98
+
| Plot | Description | Best For | Adds | Why |
99
+
|------|--------------|----------|------|-----|
100
+
|**Betweenness Centrality Distribution**| Histogram of betweenness values. | Identifying code units that act as structural bridges. | Insight into flow of dependency control. | Detects potential bottlenecks or single points of failure. |
101
+
|**Clustering Coefficient Distribution**| Histogram of local clustering coefficients. | Identifying modularity and local cohesion. | Insight into how tightly code units cluster. | Reveals how cohesive or isolated different regions of the graph are. |
102
+
|**PageRank – ArticleRank Difference Distribution**| Distribution of `PageRank - ArticleRank`. | Identifying influential nodes beyond local connectivity. | Shows imbalance between influence and popularity. | Highlights components with disproportionate architectural impact. |
103
+
|**Clustering Coefficient vs PageRank**| Scatterplot comparing local clustering to global influence. | Identifying relationships between cohesion and centrality. | Visualizes trade-offs between modularity and reach. | Helps spot code units that are both locally and globally critical. |
0 commit comments