Database

PostgreSQL schemas for the Trender analytics platform with a 3-layer data pipeline architecture.

Overview

The database implements a medallion architecture (Raw → Staging → Analytics) with dimensional modeling for efficient querying and reporting.

Architecture

Raw Layer (JSONB storage)
    ↓ Validation & Enrichment
Staging Layer (Cleaned data)
    ↓ Dimensional Modeling
Analytics Layer (Facts + Dimensions)
    ↓ Pre-aggregated Views
Dashboard Queries

Schema files

File	Purpose
`init.sql`	Master initialization script (imports all schema files)
`schema/01_raw_layer.sql`	Raw data storage (GitHub API responses)
`schema/02_staging_layer.sql`	Validated & cleaned data
`schema/03_analytics_layer.sql`	Dimensional model (facts + dimensions)
`schema/04_views.sql`	Pre-aggregated analytics views
`schema/05_workflow_traces.sql`	Workflow execution tracking

Layer breakdown

1. Raw Layer (Retention: 7 days)

Purpose: Audit trail and reprocessing capability

Tables:

`raw_github_repos`

Stores complete GitHub API responses in JSONB format.

CREATE TABLE raw_github_repos (
    id BIGSERIAL PRIMARY KEY,
    repo_full_name VARCHAR(255) NOT NULL,
    api_response JSONB NOT NULL,
    fetch_timestamp TIMESTAMPTZ DEFAULT NOW(),
    source_language VARCHAR(50)  -- 'Python', 'TypeScript', 'Go', 'render'
);

Use case: Reprocess data without re-fetching from GitHub

2. Staging Layer (Retention: 7 days)

Purpose: ETL audit trail and data quality validation

Tables:

`stg_repos_validated`

Cleaned and validated repository data ready for analytics.

CREATE TABLE stg_repos_validated (
    repo_full_name VARCHAR(255) PRIMARY KEY,
    repo_url VARCHAR(500),
    language VARCHAR(50) NOT NULL,
    description TEXT,
    stars INTEGER DEFAULT 0,
    created_at TIMESTAMPTZ NOT NULL,
    updated_at TIMESTAMPTZ NOT NULL,
    readme_content TEXT,
    loaded_at TIMESTAMPTZ DEFAULT NOW()
);

Business rules applied:

Non-null language
Valid timestamps
Star count ≥ 0

3. Analytics Layer (Retention: 30 days)

Purpose: Dimensional model for high-performance analytics

Dimension tables:

`dim_repositories`

Repository master data with SCD Type 2 history.

CREATE TABLE dim_repositories (
    repo_key BIGSERIAL PRIMARY KEY,
    repo_full_name VARCHAR(255) NOT NULL,
    repo_url VARCHAR(500),
    description TEXT,
    readme_content TEXT,
    language VARCHAR(50),
    created_at TIMESTAMPTZ,
    render_category VARCHAR(50),
    valid_from TIMESTAMPTZ DEFAULT NOW(),
    valid_to TIMESTAMPTZ,
    is_current BOOLEAN DEFAULT TRUE
);

SCD Type 2: Tracks changes over time (description, README updates)

`dim_languages`

Language reference data.

CREATE TABLE dim_languages (
    language_key BIGSERIAL PRIMARY KEY,
    language_name VARCHAR(50) UNIQUE NOT NULL,
    language_category VARCHAR(50),
    display_name VARCHAR(100)
);

Pre-populated values:

Python
TypeScript
Go
render (for Render ecosystem repos)

`dim_render_services`

Render service type reference data.

CREATE TABLE dim_render_services (
    service_key BIGSERIAL PRIMARY KEY,
    service_type VARCHAR(50) UNIQUE NOT NULL,
    service_category VARCHAR(50),
    description TEXT
);

Pre-populated values:

web, worker, cron, background-worker, private-service, static-site, postgres, redis

Fact tables:

`fact_repo_snapshots`

Daily snapshots of repository metrics and momentum scores.

CREATE TABLE fact_repo_snapshots (
    snapshot_key BIGSERIAL PRIMARY KEY,
    repo_key BIGINT REFERENCES dim_repositories(repo_key),
    language_key BIGINT REFERENCES dim_languages(language_key),
    snapshot_date DATE NOT NULL,
    stars INTEGER,
    star_velocity INTEGER,
    activity_score INTEGER,
    momentum_score NUMERIC(5, 3),
    rank_overall INTEGER,
    rank_in_language INTEGER
);

Unique constraint: (repo_key, snapshot_date)

`fact_workflow_runs`

Workflow execution traces for observability.

CREATE TABLE fact_workflow_runs (
    run_id VARCHAR(50) PRIMARY KEY,
    started_at TIMESTAMPTZ NOT NULL,
    completed_at TIMESTAMPTZ,
    status VARCHAR(20),
    task_tree JSONB,
    error_message TEXT,
    repos_processed INTEGER,
    execution_time_seconds NUMERIC(10, 2)
);

4. Analytics views

Pre-aggregated views for dashboard queries.

`analytics_trending_repos_current`

Top trending repos across all languages.

SELECT 
    dr.repo_full_name,
    dr.language,
    frs.stars,
    frs.momentum_score,
    dr.description,
    dr.readme_content,
    dr.created_at
FROM fact_repo_snapshots frs
JOIN dim_repositories dr ON frs.repo_key = dr.repo_key
WHERE frs.snapshot_date = CURRENT_DATE
ORDER BY frs.momentum_score DESC
LIMIT 100;

`analytics_render_showcase`

Render ecosystem projects.

SELECT
    dr.repo_full_name,
    dr.render_category,
    frs.stars,
    frs.momentum_score,
    frs.star_velocity,
    frs.activity_score
FROM dim_repositories dr
JOIN fact_repo_snapshots frs ON dr.repo_key = frs.repo_key
WHERE dr.language = 'render'
  AND frs.snapshot_date = CURRENT_DATE
ORDER BY frs.momentum_score DESC;

`analytics_language_rankings`

Per-language leaderboards with Render adoption stats.

SELECT 
    dl.language_name,
    dr.repo_full_name,
    frs.stars,
    frs.momentum_score,
    frs.rank_in_language,
    (dr.language = 'render') as uses_render
FROM fact_repo_snapshots frs
JOIN dim_repositories dr ON frs.repo_key = dr.repo_key
JOIN dim_languages dl ON frs.language_key = dl.language_key
WHERE frs.snapshot_date = CURRENT_DATE
ORDER BY dl.language_name, frs.rank_in_language;

Initialization

Option 1: Setup script (recommended)

./bin/db_setup.sh

Option 2: psql directly

cd database
psql $DATABASE_URL -f init.sql

Option 3: Individual schema files

cd database
psql $DATABASE_URL -f schema/01_raw_layer.sql
psql $DATABASE_URL -f schema/02_staging_layer.sql
psql $DATABASE_URL -f schema/03_analytics_layer.sql
psql $DATABASE_URL -f schema/04_views.sql
psql $DATABASE_URL -f schema/05_workflow_traces.sql

Verify

psql $DATABASE_URL -c "\dt"
psql $DATABASE_URL -c "\dv"

Expected:

9 tables
6+ views

Maintenance scripts

`data_retention_cleanup.sql`

Applies tiered retention policy (run automatically after each workflow).

Retention windows:

Raw layer: 7 days
Staging layer: 7 days
Analytics layer: 30 days

Manual run:

psql $DATABASE_URL -f database/data_retention_cleanup.sql

`cleanup_workflow_tracking.sql`

Migration script to remove old workflow execution tables (if upgrading).

psql $DATABASE_URL -f database/cleanup_workflow_tracking.sql

`clear_data.sql`

DESTRUCTIVE: Truncates all tables (for testing only).

psql $DATABASE_URL -f database/clear_data.sql

`check_analytics_load.sql`

Diagnostic query to verify data loading.

psql $DATABASE_URL -f database/check_analytics_load.sql

Checks:

Row counts per table
Latest snapshot dates
Repository distribution by language

Query utilities

`row_counts.sql`

Get row counts for all tables.

psql $DATABASE_URL -f database/row_counts.sql

`repos_by_language.sql`

Count repos per language.

psql $DATABASE_URL -f database/repos_by_language.sql

`filter_vercel_repos.sql`

Example filtering query (customize as needed).

psql $DATABASE_URL -f database/filter_vercel_repos.sql

Indexing strategy

Key indexes for performance:

-- Dimension lookups
CREATE INDEX idx_dim_repos_fullname ON dim_repositories(repo_full_name) WHERE is_current = TRUE;
CREATE INDEX idx_dim_languages_name ON dim_languages(language_name);

-- Fact table queries
CREATE INDEX idx_fact_snapshots_date ON fact_repo_snapshots(snapshot_date);
CREATE INDEX idx_fact_snapshots_momentum ON fact_repo_snapshots(momentum_score DESC);

-- Workflow traces
CREATE INDEX idx_workflow_runs_started ON fact_workflow_runs(started_at DESC);

Data flow

GitHub API → workflows/workflow.py
    ↓ (store_raw_repos)
raw_github_repos
    ↓ (store_in_staging)
stg_repos_validated
    ↓ (load_to_analytics_simple)
dim_repositories
fact_repo_snapshots
    ↓ (views)
analytics_trending_repos_current
analytics_render_showcase
    ↓ (queries)
dashboard/lib/db.ts
    ↓
Next.js components

Contributing

Add new schema files to schema/
Update init.sql to import new files
Document tables in this README
Test with ./bin/db_setup.sh
Add indexes for frequently queried columns

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Database

Overview

Architecture

Schema files

Layer breakdown

1. Raw Layer (Retention: 7 days)

`raw_github_repos`

2. Staging Layer (Retention: 7 days)

`stg_repos_validated`

3. Analytics Layer (Retention: 30 days)

`dim_repositories`

`dim_languages`

`dim_render_services`

`fact_repo_snapshots`

`fact_workflow_runs`

4. Analytics views

`analytics_trending_repos_current`

`analytics_render_showcase`

`analytics_language_rankings`

Initialization

Option 1: Setup script (recommended)

Option 2: psql directly

Option 3: Individual schema files

Verify

Maintenance scripts

`data_retention_cleanup.sql`

`cleanup_workflow_tracking.sql`

`clear_data.sql`

`check_analytics_load.sql`

Query utilities

`row_counts.sql`

`repos_by_language.sql`

`filter_vercel_repos.sql`

Indexing strategy

Data flow

Contributing

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Database

Overview

Architecture

Schema files

Layer breakdown

1. Raw Layer (Retention: 7 days)

raw_github_repos

2. Staging Layer (Retention: 7 days)

stg_repos_validated

3. Analytics Layer (Retention: 30 days)

dim_repositories

dim_languages

dim_render_services

fact_repo_snapshots

fact_workflow_runs

4. Analytics views

analytics_trending_repos_current

analytics_render_showcase

analytics_language_rankings

Initialization

Option 1: Setup script (recommended)

Option 2: psql directly

Option 3: Individual schema files

Verify

Maintenance scripts

data_retention_cleanup.sql

cleanup_workflow_tracking.sql

clear_data.sql

check_analytics_load.sql

Query utilities

row_counts.sql

repos_by_language.sql

filter_vercel_repos.sql

Indexing strategy

Data flow

Contributing

`raw_github_repos`

`stg_repos_validated`

`dim_repositories`

`dim_languages`

`dim_render_services`

`fact_repo_snapshots`

`fact_workflow_runs`

`analytics_trending_repos_current`

`analytics_render_showcase`

`analytics_language_rankings`

`data_retention_cleanup.sql`

`cleanup_workflow_tracking.sql`

`clear_data.sql`

`check_analytics_load.sql`

`row_counts.sql`

`repos_by_language.sql`

`filter_vercel_repos.sql`