Snapchat · svij-sc · Apr 17, 2026 · Apr 17, 2026 · Apr 17, 2026 · Apr 17, 2026
@@ -0,0 +1,188 @@
+# GiGL Analytics
+
+Pre-training graph data validation and analysis tooling. Use this module before committing to a GNN training run to
+catch data quality and structural issues that silently degrade model quality.
+
+Two subpackages:
+
+- [`data_analyzer/`](data_analyzer/) — end-to-end `DataAnalyzer` that runs BigQuery checks and produces a single
+  self-contained HTML report. **Start here.**
+- [`graph_validation/`](graph_validation/) — lightweight standalone validators (currently: `BQGraphValidator` for
+  dangling-edge checks). Use when you only need one check and not the full report.
+
+## Quickstart
+
+**Prerequisites.** Follow the [GiGL installation guide](../../docs/user_guide/getting_started/installation.md) so that
+`uv` and GiGL's Python dependencies are available. Then authenticate to BigQuery:
+
+```bash
+gcloud auth application-default login
+```
+
+**1. Write a YAML config.** Save as `my_analyzer_config.yaml`:
+
+```yaml
+node_tables:
+  - bq_table: "your-project.your_dataset.user_nodes"
+    node_type: "user"
+    id_column: "user_id"
+    feature_columns: ["age", "country"]  # optional; [] or omit if the node has no features
+    # label_column: "label"              # optional; enables Tier 3 label checks
+
+edge_tables:
+  - bq_table: "your-project.your_dataset.user_edges"
+    edge_type: "follows"
+    src_id_column: "src_user_id"
+    dst_id_column: "dst_user_id"
+
+# Where to write the HTML report. Local path for quick iteration, or a gs:// URI.
+output_gcs_path: "/tmp/my_analysis/"
+
+# Optional: sizing for the neighbor-explosion estimate (fan-out per GNN layer).
+fan_out: [15, 10, 5]
+```
+
+**2. Run the analyzer.**
+
+```bash
+uv run python -m gigl.analytics.data_analyzer \
+    --analyzer_config_uri my_analyzer_config.yaml
+```
+
+**3. Open the report.** When the run completes:
+
+```
+[INFO] Report written to /tmp/my_analysis/report.html
+```
+
+Open the file in any browser. No server, no external dependencies, fully offline.
+
+## What it checks
+
+The analyzer organizes checks into four tiers. Tiers 1 and 2 always run; Tier 3 auto-enables when your config supports
+it; Tier 4 is opt-in.
+
+| Tier                         | When                                                                                 | What it checks                                                                                                                                                                                                                                                                         |
+| ---------------------------- | ------------------------------------------------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| **1. Hard fails**            | Always                                                                               | Dangling edges (NULL src/dst), referential integrity (edges pointing to nodes not in the node table), duplicate nodes. Raises `DataQualityError` — the report still renders to show partial results.                                                                                   |
+| **2. Core metrics**          | Always                                                                               | Node/edge counts, degree distribution (in/out) with percentiles, degree buckets, top-K hubs, super-hub int16 clamp count, cold-start node count, self-loops, duplicate edges, NULL rates per column, feature memory budget estimate, neighbor-explosion estimate (requires `fan_out`). |
+| **3. Label + heterogeneous** | Auto when `label_column` is set on any node table, or when multiple edge types exist | Class imbalance, label coverage, edge type distribution, per-edge-type node coverage.                                                                                                                                                                                                  |
+| **4. Advanced**              | Opt-in via config flags                                                              | Power-law exponent (implemented as a degree-stats approximation). Reciprocity, homophily, connected components, clustering coefficient are **not yet implemented** — the flags are accepted but currently no-op.                                                                       |
+
+The thresholds below come from a review of production GNN papers (PinSage, BLADE, LiGNN, TwHIN, AliGraph, GraphSMOTE,
+Beyond Homophily, Feature Propagation, and others). See the inline citations in the threshold table for what each paper
+contributes.
+
+## Interpreting the report
+
+The report color-codes every numeric finding. Summary of the most important thresholds:
+
+| Metric                                                   | Green | Yellow     | Red     | What to do when yellow/red                                                                                                                                    |
+| -------------------------------------------------------- | ----- | ---------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| Dangling edges / referential integrity / duplicate nodes | 0     | —          | any > 0 | Fix the input tables. Training will fail or silently corrupt otherwise.                                                                                       |
+| Feature missing rate                                     | < 10% | 10–50%     | > 90%   | Plan an imputation strategy; above ~95% the Feature Propagation phase transition (Rossi et al., ICLR 2022) hits and GNNs stop recovering signal reliably.     |
+| Isolated node fraction                                   | < 1%  | 1–5%       | > 5%    | Filter isolated nodes or densify (LiGNN, KDD 2024) for cold-start cohorts.                                                                                    |
+| Cold-start fraction (degree 0–1)                         | < 5%  | 5–10%      | > 10%   | Candidates for graph densification; also flag for special handling at serving time.                                                                           |
+| Super-hub int16 clamp (degree > 32,767)                  | 0     | —          | any > 0 | GiGL silently truncates super-hub degrees in `gigl/distributed/utils/degree.py`. Either cap the hub's edges upstream or plan to address the clamp.            |
+| Degree p99 / median                                      | < 50  | 50–100     | > 100   | Use importance sampling (PinSage, KDD 2018) or degree-adaptive neighborhoods (BLADE, WSDM 2023) — degree skew is the single biggest lever in production GNNs. |
+| Class imbalance ratio                                    | < 1:5 | 1:5 – 1:10 | > 1:10  | Message passing amplifies label imbalance 2–3× in representation space (GraphSMOTE, WSDM 2021). Consider resampling or GraphSMOTE-style synthetic nodes.      |
+| Edge homophily (Tier 4, future)                          | > 0.7 | 0.3 – 0.7  | < 0.3   | Standard GCN/GAT fail at low h (Zhu et al., NeurIPS 2020). Consider H2GCN-style architectures; below h ≈ 0.2 a plain MLP often wins.                          |
+
+## Advanced config
+
+Optional YAML keys beyond the minimal quickstart:
+
+```yaml
+# Enable Tier 3 class-imbalance + label-coverage checks for a node type:
+node_tables:
+  - bq_table: ...
+    label_column: "label"
+
+# Neighbor explosion estimation — the fan-out per GNN layer you plan to train with:
+fan_out: [15, 10, 5]
+
+# Tier 4 opt-in flags. Default false.
+# NOTE: Only `compute_reciprocity` is wired into the analyzer today and it logs a
+# warning rather than computing a result. The other three flags are placeholders
+# for future work (see "Scope and limitations" below).
+compute_reciprocity: true
+compute_homophily: true
+compute_connected_components: true
+compute_clustering: true
+
+# Per-edge-type timestamp hint. NOTE: accepted by the config schema but not yet
+# consumed by any Tier 4 query (temporal freshness check is planned).
+edge_tables:
+  - bq_table: ...
+    timestamp_column: "created_at"
+```
+
+## Python API
+
+The CLI wraps a regular class. Call from your own code when you want programmatic access to the `GraphAnalysisResult`:
+
+```python
+from gigl.analytics.data_analyzer import DataAnalyzer
+from gigl.analytics.data_analyzer.config import load_analyzer_config
+
+config = load_analyzer_config("my_analyzer_config.yaml")
+analyzer = DataAnalyzer()
+report_path = analyzer.run(config=config)
+# report_path points to the written report.html (local path or gs:// URI)
+```
+
+The underlying `GraphStructureAnalyzer` is also callable directly if you want the raw result dataclass and no HTML:
+
+```python
+from gigl.analytics.data_analyzer.graph_structure_analyzer import GraphStructureAnalyzer
+
+result = GraphStructureAnalyzer().analyze(config)
+print(result.degree_stats)
+```
+
+See a rendered report example at
+[`tests/test_assets/analytics/golden_report.html`](../../tests/test_assets/analytics/golden_report.html) to preview the
+output format before authenticating to BQ.
+
+## graph_validation
+
+One-off validators for the subset of cases where the full analyzer is overkill. Today the only check is dangling-edge
+detection:
+
+```python
+from gigl.analytics.graph_validation import BQGraphValidator
+
+has_dangling = BQGraphValidator.does_edge_table_have_dangling_edges(
+    edge_table="your-project.your_dataset.user_edges",
+    src_node_column_name="src_user_id",
+    dst_node_column_name="dst_user_id",
+)
+```
+
+The `DataAnalyzer` runs this check (and many more) as part of Tier 1, so prefer the full analyzer unless you
+specifically need a one-line gate (e.g., inside an Airflow task or a preprocessing job). This subpackage is the intended
+home for additional standalone validators in the future.
+
+## Scope and limitations
+
+Current implementation status:
+
+- **FeatureProfiler is a stub.** The class is wired in but the TFDV/Dataflow pipeline that would produce FACETS HTML per
+  table is deferred to a follow-up PR. Calling it today logs a warning and returns an empty `FeatureProfileResult`. The
+  main report is fully functional without it.
+- **Tier 4 checks are partial.** Power-law exponent is computed as a degree-stats approximation. Reciprocity, homophily,
+  connected components, and clustering coefficient config flags are accepted but currently no-op. The `timestamp_column`
+  edge field is accepted but no temporal-freshness query runs yet.
+- **Heterogeneous graphs: referential integrity caveat.** For each edge table, the referential-integrity check joins
+  against `config.node_tables[0]`. On heterogeneous graphs where different edges reference different node types, the
+  current implementation will under-report integrity violations — fix is tracked for a follow-up.
+- **GCS upload** works via `GcsUtils.upload_from_string` when `output_gcs_path` is a `gs://` URI, and falls back to
+  local filesystem write otherwise.
+
+## Related documents
+
+Within this module:
+
+- [`data_analyzer/report/PRD.md`](data_analyzer/report/PRD.md) — product intent for the HTML report (AI-owned)
+- [`data_analyzer/report/SPEC.md`](data_analyzer/report/SPEC.md) — technical contract for the AI-owned HTML/JS/CSS
+  assets
@@ -0,0 +1,10 @@
+"""
+BQ Data Analyzer for pre-training graph data analysis.
+
+Produces a single HTML report covering data quality, feature distributions,
+and graph structure metrics from BigQuery node/edge tables.
+"""
+
+from gigl.analytics.data_analyzer.data_analyzer import DataAnalyzer
+
+__all__ = ["DataAnalyzer"]
@@ -0,0 +1,6 @@
+"""Entry point for running the BQ Data Analyzer as a module: python -m gigl.analytics.data_analyzer."""
+
+from gigl.analytics.data_analyzer.data_analyzer import main
+
+if __name__ == "__main__":
+    main()
@@ -0,0 +1,177 @@
+import re
+from dataclasses import dataclass, field
+from typing import Optional
+
+from omegaconf import MISSING, OmegaConf
+
+from gigl.common.logger import Logger
+
+logger = Logger()
+
+# BigQuery identifier regexes used to reject configs that would be interpolated
+# directly into SQL. See https://cloud.google.com/bigquery/docs/reference/standard-sql/lexical
+# for the allowed grammar. Tables are of the form project.dataset.table;
+# columns are simple unquoted identifiers.
+_BQ_TABLE_REGEX = re.compile(r"^[A-Za-z0-9_.\-]+\.[A-Za-z0-9_\-]+\.[A-Za-z0-9_$\-]+$")
+_BQ_COLUMN_REGEX = re.compile(r"^[A-Za-z_][A-Za-z0-9_]*$")
+
+
+def _validate_bq_table(name: str, field_label: str) -> None:
+    if not _BQ_TABLE_REGEX.fullmatch(name):
+        raise ValueError(
+            f"{field_label}={name!r} is not a valid BigQuery table reference. "
+            f"Expected project.dataset.table with no backticks, whitespace, or quotes."
+        )
+
+
+def _validate_bq_column(name: str, field_label: str) -> None:
+    if not _BQ_COLUMN_REGEX.fullmatch(name):
+        raise ValueError(
+            f"{field_label}={name!r} is not a valid BigQuery column identifier. "
+            f"Expected [A-Za-z_][A-Za-z0-9_]* with no backticks, whitespace, or quotes."
+        )
+
+
+@dataclass
+class NodeTableSpec:
+    """Specification for a node table in BigQuery."""
+
+    bq_table: str = MISSING
+    node_type: str = MISSING
+    id_column: str = MISSING
+    feature_columns: list[str] = field(default_factory=list)
+    label_column: Optional[str] = None
+
+
+@dataclass
+class EdgeTableSpec:
+    """Specification for an edge table in BigQuery.
+
+    For heterogeneous graphs (more than one node table), src_node_type and
+    dst_node_type must be set to the node_type of the matching node table.
+    For homogeneous graphs (single node table) they default to that node_type.
+    """
+
+    bq_table: str = MISSING
+    edge_type: str = MISSING
+    src_id_column: str = MISSING
+    dst_id_column: str = MISSING
+    src_node_type: Optional[str] = None
+    dst_node_type: Optional[str] = None
+    feature_columns: list[str] = field(default_factory=list)
+    timestamp_column: Optional[str] = None
+
+
+@dataclass
+class DataAnalyzerConfig:
+    """Configuration for the BQ Data Analyzer.
+
+    Parsed from YAML via OmegaConf.
+
+    Example:
+        >>> config = load_analyzer_config("gs://bucket/config.yaml")
+        >>> config.node_tables[0].bq_table
+        'project.dataset.user_nodes'
+    """
+
+    node_tables: list[NodeTableSpec] = MISSING
+    edge_tables: list[EdgeTableSpec] = MISSING
+    output_gcs_path: str = MISSING
+    fan_out: Optional[list[int]] = None
+    compute_reciprocity: bool = False
+    compute_homophily: bool = False
+    compute_connected_components: bool = False
+    compute_clustering: bool = False
+
+
+def _validate_and_backfill(config: DataAnalyzerConfig) -> None:
+    """Run identifier validation and backfill default node-type references.
+
+    - Every bq_table must match project.dataset.table.
+    - Every id_column / src_id_column / dst_id_column / feature_column /
+      label_column / timestamp_column must be a bare BQ identifier.
+    - For homogeneous configs, an edge table with no src_node_type /
+      dst_node_type inherits the single node table's node_type.
+    - For heterogeneous configs, every edge table must explicitly declare
+      src_node_type and dst_node_type, and both must resolve to a known
+      node_type.
+    """
+    known_node_types = {nt.node_type for nt in config.node_tables}
+    single_node_type: Optional[str] = (
+        next(iter(known_node_types)) if len(config.node_tables) == 1 else None
+    )
+
+    for node_table in config.node_tables:
+        _validate_bq_table(node_table.bq_table, "node_tables.bq_table")
+        _validate_bq_column(node_table.id_column, "node_tables.id_column")
+        for col in node_table.feature_columns:
+            _validate_bq_column(col, "node_tables.feature_columns")
+        if node_table.label_column is not None:
+            _validate_bq_column(node_table.label_column, "node_tables.label_column")
+
+    for edge_table in config.edge_tables:
+        _validate_bq_table(edge_table.bq_table, "edge_tables.bq_table")
+        _validate_bq_column(edge_table.src_id_column, "edge_tables.src_id_column")
+        _validate_bq_column(edge_table.dst_id_column, "edge_tables.dst_id_column")
+        for col in edge_table.feature_columns:
+            _validate_bq_column(col, "edge_tables.feature_columns")
+        if edge_table.timestamp_column is not None:
+            _validate_bq_column(
+                edge_table.timestamp_column, "edge_tables.timestamp_column"
+            )
+
+        if edge_table.src_node_type is None:
+            if single_node_type is not None:
+                edge_table.src_node_type = single_node_type
+            else:
+                raise ValueError(
+                    f"edge_type={edge_table.edge_type}: src_node_type is required "
+                    f"when there are multiple node tables"
+                )
+        if edge_table.dst_node_type is None:
+            if single_node_type is not None:
+                edge_table.dst_node_type = single_node_type
+            else:
+                raise ValueError(
+                    f"edge_type={edge_table.edge_type}: dst_node_type is required "
+                    f"when there are multiple node tables"
+                )
+        if edge_table.src_node_type not in known_node_types:
+            raise ValueError(
+                f"edge_type={edge_table.edge_type}: src_node_type="
+                f"{edge_table.src_node_type!r} is not a declared node_type. "
+                f"Known: {sorted(known_node_types)}"
+            )
+        if edge_table.dst_node_type not in known_node_types:
+            raise ValueError(
+                f"edge_type={edge_table.edge_type}: dst_node_type="
+                f"{edge_table.dst_node_type!r} is not a declared node_type. "
+                f"Known: {sorted(known_node_types)}"
+            )
+
+
+def load_analyzer_config(config_path: str) -> DataAnalyzerConfig:
+    """Load and validate a DataAnalyzerConfig from a YAML file.
+
+    Args:
+        config_path: Local file path or GCS URI to the YAML config.
+
+    Returns:
+        Validated DataAnalyzerConfig instance with node-type references
+        backfilled on edge tables.
+
+    Raises:
+        omegaconf.errors.MissingMandatoryValue: If required fields are missing.
+        ValueError: If any bq_table or column name is not a valid BigQuery
+            identifier, or if a heterogeneous config is missing a required
+            src_node_type / dst_node_type.
+    """
+    raw = OmegaConf.load(config_path)
+    merged = OmegaConf.merge(OmegaConf.structured(DataAnalyzerConfig), raw)
+    config: DataAnalyzerConfig = OmegaConf.to_object(merged)  # type: ignore
+    _validate_and_backfill(config)
+    logger.info(
+        f"Loaded analyzer config with {len(config.node_tables)} node tables "
+        f"and {len(config.edge_tables)} edge tables"
+    )
+    return config