Skip to content

feat: data lineage view — "where did this number come from" #107

@William-Hill

Description

@William-Hill

Context

From AASCU Intermediary feedback session (see docs/aascu_intermediary_feedback_summary.md, pain points A + D):

"The dashboard will sometimes pull from the wrong set... institutions could be missing, campuses that have submitted data may not be actually being listed."

"Full transparency on what that means, what the outputs look like, but then also where that data is being stored. So, yeah, like, full data governance, full data lineage."

PDP loses institutional trust when numbers can't be traced back to source rows. Andres explicitly flagged data being incorrectly processed at submission time. The strongest competitive differentiator we can build is provable lineage — click any number, see exactly which uploaded rows produced it.

Goal

Every aggregate number in the dashboard is traceable to (a) the source rows, (b) the upload event that introduced them, (c) any transformations applied, and (d) timestamps for each step.

Scope

  • Click any KPI / chart bar / table cell → opens a lineage drawer
  • Drawer shows:
    • Source row IDs (paginated list)
    • Upload event ID + timestamp + uploader
    • Transformation chain (e.g., "filtered to cohort=2022", "aggregated mean GPA")
    • SQL or pipeline step IDs that produced the value
  • New table or columns to track upload-event provenance per row
  • Read-only API endpoint: GET /api/lineage?metric=<id>&filters=<...>

Out of scope

Acceptance criteria

Why this is P0

This is the single highest-leverage gap from the intermediary session. It directly answers (a) the data-trust complaint, (b) the AI-governance requirement, and (c) the differentiator-vs-PDP question in one feature.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions