Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
c079b9f
feat(analytics): scaffold data_analyzer package structure
svij-sc Apr 17, 2026
3988493
feat(analytics): add DataAnalyzerConfig with YAML loading and tests
svij-sc Apr 17, 2026
cf69b38
fix(analytics): remove unused imports in config_test.py
svij-sc Apr 17, 2026
8abae4a
feat(analytics): add result type dataclasses (DegreeStats, GraphAnaly…
svij-sc Apr 17, 2026
f1c7f52
feat(analytics): add 18 SQL query templates for graph structure analysis
svij-sc Apr 17, 2026
21255d0
feat(analytics): add GraphStructureAnalyzer with 4-tier BQ validation
svij-sc Apr 17, 2026
793190c
style(analytics): apply black formatter to test files
svij-sc Apr 17, 2026
0b01b5c
feat(analytics): add report SPEC.md and initial AI-owned HTML/JS/CSS …
svij-sc Apr 17, 2026
28503d9
feat(analytics): add ReportGenerator with snapshot test
svij-sc Apr 17, 2026
018e35e
feat(analytics): add DataAnalyzer orchestrator with CLI entry point
svij-sc Apr 17, 2026
42f8d78
feat(analytics): add FeatureProfiler stub (TFDV/Dataflow integration …
svij-sc Apr 17, 2026
56eb170
fix(analytics): cast OmegaConf.to_object result in config_test
svij-sc Apr 17, 2026
7f387f6
style(analytics): apply isort and mdformat to data_analyzer files
svij-sc Apr 17, 2026
14df2b8
docs(analytics): add PRD.md for HTML report (product intent)
svij-sc Apr 17, 2026
5e166fa
docs(analytics): add BQ Data Analyzer design docs, literature review,…
svij-sc Apr 17, 2026
d3f1eb8
delete plans
svij-sc Apr 18, 2026
c2c05e2
feat(analytics): write the HTML report to disk or GCS from the orches…
svij-sc Apr 20, 2026
e67eeac
docs(analytics): add practitioner README for the analytics module
svij-sc Apr 20, 2026
40f379a
fix(analytics): address code-reviewer feedback on practitioner README
svij-sc Apr 20, 2026
826c893
tfdv
svij-sc Apr 21, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
188 changes: 188 additions & 0 deletions gigl/analytics/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,188 @@
# GiGL Analytics

Pre-training graph data validation and analysis tooling. Use this module before committing to a GNN training run to
catch data quality and structural issues that silently degrade model quality.

Two subpackages:

- [`data_analyzer/`](data_analyzer/) — end-to-end `DataAnalyzer` that runs BigQuery checks and produces a single
self-contained HTML report. **Start here.**
- [`graph_validation/`](graph_validation/) — lightweight standalone validators (currently: `BQGraphValidator` for
dangling-edge checks). Use when you only need one check and not the full report.

## Quickstart

**Prerequisites.** Follow the [GiGL installation guide](../../docs/user_guide/getting_started/installation.md) so that
`uv` and GiGL's Python dependencies are available. Then authenticate to BigQuery:

```bash
gcloud auth application-default login
```

**1. Write a YAML config.** Save as `my_analyzer_config.yaml`:

```yaml
node_tables:
- bq_table: "your-project.your_dataset.user_nodes"
node_type: "user"
id_column: "user_id"
feature_columns: ["age", "country"] # optional; [] or omit if the node has no features
# label_column: "label" # optional; enables Tier 3 label checks

edge_tables:
- bq_table: "your-project.your_dataset.user_edges"
edge_type: "follows"
src_id_column: "src_user_id"
dst_id_column: "dst_user_id"

# Where to write the HTML report. Local path for quick iteration, or a gs:// URI.
output_gcs_path: "/tmp/my_analysis/"

# Optional: sizing for the neighbor-explosion estimate (fan-out per GNN layer).
fan_out: [15, 10, 5]
```

**2. Run the analyzer.**

```bash
uv run python -m gigl.analytics.data_analyzer \
--analyzer_config_uri my_analyzer_config.yaml
```

**3. Open the report.** When the run completes:

```
[INFO] Report written to /tmp/my_analysis/report.html
```

Open the file in any browser. No server, no external dependencies, fully offline.

## What it checks

The analyzer organizes checks into four tiers. Tiers 1 and 2 always run; Tier 3 auto-enables when your config supports
it; Tier 4 is opt-in.

| Tier | When | What it checks |
| ---------------------------- | ------------------------------------------------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **1. Hard fails** | Always | Dangling edges (NULL src/dst), referential integrity (edges pointing to nodes not in the node table), duplicate nodes. Raises `DataQualityError` — the report still renders to show partial results. |
| **2. Core metrics** | Always | Node/edge counts, degree distribution (in/out) with percentiles, degree buckets, top-K hubs, super-hub int16 clamp count, cold-start node count, self-loops, duplicate edges, NULL rates per column, feature memory budget estimate, neighbor-explosion estimate (requires `fan_out`). |
| **3. Label + heterogeneous** | Auto when `label_column` is set on any node table, or when multiple edge types exist | Class imbalance, label coverage, edge type distribution, per-edge-type node coverage. |
| **4. Advanced** | Opt-in via config flags | Power-law exponent (implemented as a degree-stats approximation). Reciprocity, homophily, connected components, clustering coefficient are **not yet implemented** — the flags are accepted but currently no-op. |

The thresholds below come from a review of production GNN papers (PinSage, BLADE, LiGNN, TwHIN, AliGraph, GraphSMOTE,
Beyond Homophily, Feature Propagation, and others). See the inline citations in the threshold table for what each paper
contributes.

## Interpreting the report

The report color-codes every numeric finding. Summary of the most important thresholds:

| Metric | Green | Yellow | Red | What to do when yellow/red |
| -------------------------------------------------------- | ----- | ---------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Dangling edges / referential integrity / duplicate nodes | 0 | — | any > 0 | Fix the input tables. Training will fail or silently corrupt otherwise. |
| Feature missing rate | < 10% | 10–50% | > 90% | Plan an imputation strategy; above ~95% the Feature Propagation phase transition (Rossi et al., ICLR 2022) hits and GNNs stop recovering signal reliably. |
| Isolated node fraction | < 1% | 1–5% | > 5% | Filter isolated nodes or densify (LiGNN, KDD 2024) for cold-start cohorts. |
| Cold-start fraction (degree 0–1) | < 5% | 5–10% | > 10% | Candidates for graph densification; also flag for special handling at serving time. |
| Super-hub int16 clamp (degree > 32,767) | 0 | — | any > 0 | GiGL silently truncates super-hub degrees in `gigl/distributed/utils/degree.py`. Either cap the hub's edges upstream or plan to address the clamp. |
| Degree p99 / median | < 50 | 50–100 | > 100 | Use importance sampling (PinSage, KDD 2018) or degree-adaptive neighborhoods (BLADE, WSDM 2023) — degree skew is the single biggest lever in production GNNs. |
| Class imbalance ratio | < 1:5 | 1:5 – 1:10 | > 1:10 | Message passing amplifies label imbalance 2–3× in representation space (GraphSMOTE, WSDM 2021). Consider resampling or GraphSMOTE-style synthetic nodes. |
| Edge homophily (Tier 4, future) | > 0.7 | 0.3 – 0.7 | < 0.3 | Standard GCN/GAT fail at low h (Zhu et al., NeurIPS 2020). Consider H2GCN-style architectures; below h ≈ 0.2 a plain MLP often wins. |

## Advanced config

Optional YAML keys beyond the minimal quickstart:

```yaml
# Enable Tier 3 class-imbalance + label-coverage checks for a node type:
node_tables:
- bq_table: ...
label_column: "label"

# Neighbor explosion estimation — the fan-out per GNN layer you plan to train with:
fan_out: [15, 10, 5]

# Tier 4 opt-in flags. Default false.
# NOTE: Only `compute_reciprocity` is wired into the analyzer today and it logs a
# warning rather than computing a result. The other three flags are placeholders
# for future work (see "Scope and limitations" below).
compute_reciprocity: true
compute_homophily: true
compute_connected_components: true
compute_clustering: true

# Per-edge-type timestamp hint. NOTE: accepted by the config schema but not yet
# consumed by any Tier 4 query (temporal freshness check is planned).
edge_tables:
- bq_table: ...
timestamp_column: "created_at"
```

## Python API

The CLI wraps a regular class. Call from your own code when you want programmatic access to the `GraphAnalysisResult`:

```python
from gigl.analytics.data_analyzer import DataAnalyzer
from gigl.analytics.data_analyzer.config import load_analyzer_config

config = load_analyzer_config("my_analyzer_config.yaml")
analyzer = DataAnalyzer()
report_path = analyzer.run(config=config)
# report_path points to the written report.html (local path or gs:// URI)
```

The underlying `GraphStructureAnalyzer` is also callable directly if you want the raw result dataclass and no HTML:

```python
from gigl.analytics.data_analyzer.graph_structure_analyzer import GraphStructureAnalyzer

result = GraphStructureAnalyzer().analyze(config)
print(result.degree_stats)
```

See a rendered report example at
[`tests/test_assets/analytics/golden_report.html`](../../tests/test_assets/analytics/golden_report.html) to preview the
output format before authenticating to BQ.

## graph_validation

One-off validators for the subset of cases where the full analyzer is overkill. Today the only check is dangling-edge
detection:

```python
from gigl.analytics.graph_validation import BQGraphValidator

has_dangling = BQGraphValidator.does_edge_table_have_dangling_edges(
edge_table="your-project.your_dataset.user_edges",
src_node_column_name="src_user_id",
dst_node_column_name="dst_user_id",
)
```

The `DataAnalyzer` runs this check (and many more) as part of Tier 1, so prefer the full analyzer unless you
specifically need a one-line gate (e.g., inside an Airflow task or a preprocessing job). This subpackage is the intended
home for additional standalone validators in the future.

## Scope and limitations

Current implementation status:

- **FeatureProfiler is a stub.** The class is wired in but the TFDV/Dataflow pipeline that would produce FACETS HTML per
table is deferred to a follow-up PR. Calling it today logs a warning and returns an empty `FeatureProfileResult`. The
main report is fully functional without it.
- **Tier 4 checks are partial.** Power-law exponent is computed as a degree-stats approximation. Reciprocity, homophily,
connected components, and clustering coefficient config flags are accepted but currently no-op. The `timestamp_column`
edge field is accepted but no temporal-freshness query runs yet.
- **Heterogeneous graphs: referential integrity caveat.** For each edge table, the referential-integrity check joins
against `config.node_tables[0]`. On heterogeneous graphs where different edges reference different node types, the
current implementation will under-report integrity violations — fix is tracked for a follow-up.
- **GCS upload** works via `GcsUtils.upload_from_string` when `output_gcs_path` is a `gs://` URI, and falls back to
local filesystem write otherwise.

## Related documents

Within this module:

- [`data_analyzer/report/PRD.md`](data_analyzer/report/PRD.md) — product intent for the HTML report (AI-owned)
- [`data_analyzer/report/SPEC.md`](data_analyzer/report/SPEC.md) — technical contract for the AI-owned HTML/JS/CSS
assets
10 changes: 10 additions & 0 deletions gigl/analytics/data_analyzer/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
"""
BQ Data Analyzer for pre-training graph data analysis.

Produces a single HTML report covering data quality, feature distributions,
and graph structure metrics from BigQuery node/edge tables.
"""

from gigl.analytics.data_analyzer.data_analyzer import DataAnalyzer

__all__ = ["DataAnalyzer"]
6 changes: 6 additions & 0 deletions gigl/analytics/data_analyzer/__main__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
"""Entry point for running the BQ Data Analyzer as a module: python -m gigl.analytics.data_analyzer."""

from gigl.analytics.data_analyzer.data_analyzer import main

if __name__ == "__main__":
main()
177 changes: 177 additions & 0 deletions gigl/analytics/data_analyzer/config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,177 @@
import re
from dataclasses import dataclass, field
from typing import Optional

from omegaconf import MISSING, OmegaConf

from gigl.common.logger import Logger

logger = Logger()

# BigQuery identifier regexes used to reject configs that would be interpolated
# directly into SQL. See https://cloud.google.com/bigquery/docs/reference/standard-sql/lexical
# for the allowed grammar. Tables are of the form project.dataset.table;
# columns are simple unquoted identifiers.
_BQ_TABLE_REGEX = re.compile(r"^[A-Za-z0-9_.\-]+\.[A-Za-z0-9_\-]+\.[A-Za-z0-9_$\-]+$")
_BQ_COLUMN_REGEX = re.compile(r"^[A-Za-z_][A-Za-z0-9_]*$")


def _validate_bq_table(name: str, field_label: str) -> None:
if not _BQ_TABLE_REGEX.fullmatch(name):
raise ValueError(
f"{field_label}={name!r} is not a valid BigQuery table reference. "
f"Expected project.dataset.table with no backticks, whitespace, or quotes."
)


def _validate_bq_column(name: str, field_label: str) -> None:
if not _BQ_COLUMN_REGEX.fullmatch(name):
raise ValueError(
f"{field_label}={name!r} is not a valid BigQuery column identifier. "
f"Expected [A-Za-z_][A-Za-z0-9_]* with no backticks, whitespace, or quotes."
)


@dataclass
class NodeTableSpec:
"""Specification for a node table in BigQuery."""

bq_table: str = MISSING
node_type: str = MISSING
id_column: str = MISSING
feature_columns: list[str] = field(default_factory=list)
label_column: Optional[str] = None


@dataclass
class EdgeTableSpec:
"""Specification for an edge table in BigQuery.

For heterogeneous graphs (more than one node table), src_node_type and
dst_node_type must be set to the node_type of the matching node table.
For homogeneous graphs (single node table) they default to that node_type.
"""

bq_table: str = MISSING
edge_type: str = MISSING
src_id_column: str = MISSING
dst_id_column: str = MISSING
src_node_type: Optional[str] = None
dst_node_type: Optional[str] = None
feature_columns: list[str] = field(default_factory=list)
timestamp_column: Optional[str] = None


@dataclass
class DataAnalyzerConfig:
"""Configuration for the BQ Data Analyzer.

Parsed from YAML via OmegaConf.

Example:
>>> config = load_analyzer_config("gs://bucket/config.yaml")
>>> config.node_tables[0].bq_table
'project.dataset.user_nodes'
"""

node_tables: list[NodeTableSpec] = MISSING
edge_tables: list[EdgeTableSpec] = MISSING
output_gcs_path: str = MISSING
fan_out: Optional[list[int]] = None
compute_reciprocity: bool = False
compute_homophily: bool = False
compute_connected_components: bool = False
compute_clustering: bool = False


def _validate_and_backfill(config: DataAnalyzerConfig) -> None:
"""Run identifier validation and backfill default node-type references.

- Every bq_table must match project.dataset.table.
- Every id_column / src_id_column / dst_id_column / feature_column /
label_column / timestamp_column must be a bare BQ identifier.
- For homogeneous configs, an edge table with no src_node_type /
dst_node_type inherits the single node table's node_type.
- For heterogeneous configs, every edge table must explicitly declare
src_node_type and dst_node_type, and both must resolve to a known
node_type.
"""
known_node_types = {nt.node_type for nt in config.node_tables}
single_node_type: Optional[str] = (
next(iter(known_node_types)) if len(config.node_tables) == 1 else None
)

for node_table in config.node_tables:
_validate_bq_table(node_table.bq_table, "node_tables.bq_table")
_validate_bq_column(node_table.id_column, "node_tables.id_column")
for col in node_table.feature_columns:
_validate_bq_column(col, "node_tables.feature_columns")
if node_table.label_column is not None:
_validate_bq_column(node_table.label_column, "node_tables.label_column")

for edge_table in config.edge_tables:
_validate_bq_table(edge_table.bq_table, "edge_tables.bq_table")
_validate_bq_column(edge_table.src_id_column, "edge_tables.src_id_column")
_validate_bq_column(edge_table.dst_id_column, "edge_tables.dst_id_column")
for col in edge_table.feature_columns:
_validate_bq_column(col, "edge_tables.feature_columns")
if edge_table.timestamp_column is not None:
_validate_bq_column(
edge_table.timestamp_column, "edge_tables.timestamp_column"
)

if edge_table.src_node_type is None:
if single_node_type is not None:
edge_table.src_node_type = single_node_type
else:
raise ValueError(
f"edge_type={edge_table.edge_type}: src_node_type is required "
f"when there are multiple node tables"
)
if edge_table.dst_node_type is None:
if single_node_type is not None:
edge_table.dst_node_type = single_node_type
else:
raise ValueError(
f"edge_type={edge_table.edge_type}: dst_node_type is required "
f"when there are multiple node tables"
)
if edge_table.src_node_type not in known_node_types:
raise ValueError(
f"edge_type={edge_table.edge_type}: src_node_type="
f"{edge_table.src_node_type!r} is not a declared node_type. "
f"Known: {sorted(known_node_types)}"
)
if edge_table.dst_node_type not in known_node_types:
raise ValueError(
f"edge_type={edge_table.edge_type}: dst_node_type="
f"{edge_table.dst_node_type!r} is not a declared node_type. "
f"Known: {sorted(known_node_types)}"
)


def load_analyzer_config(config_path: str) -> DataAnalyzerConfig:
"""Load and validate a DataAnalyzerConfig from a YAML file.

Args:
config_path: Local file path or GCS URI to the YAML config.

Returns:
Validated DataAnalyzerConfig instance with node-type references
backfilled on edge tables.

Raises:
omegaconf.errors.MissingMandatoryValue: If required fields are missing.
ValueError: If any bq_table or column name is not a valid BigQuery
identifier, or if a heterogeneous config is missing a required
src_node_type / dst_node_type.
"""
raw = OmegaConf.load(config_path)
merged = OmegaConf.merge(OmegaConf.structured(DataAnalyzerConfig), raw)
config: DataAnalyzerConfig = OmegaConf.to_object(merged) # type: ignore
_validate_and_backfill(config)
logger.info(
f"Loaded analyzer config with {len(config.node_tables)} node tables "
f"and {len(config.edge_tables)} edge tables"
)
return config
Loading