Skip to content

Generate Airlines Demo DB (Big Data) and Build a Flight Timetable to make a benchmark for PostgreSQL vs Doublets #11

@konard

Description

@konard

Summary

Set up the PostgresPro Airlines demo database, generate a large (“big data”) instance with the official generator, and implement flight timetable queries (departures/arrivals by airport/date/route). Use these queries to establish a repeatable benchmark and compare PostgreSQL 18 performance and ergonomics with a Doublets implementation on equivalent operations. The environment must be containerized, with two modes: default (durable) and embedded-like (WAL-light).

Scope

  • PostgreSQL side (Docker): data generation, schema inspection, timetable queries (with validity checks), basic indexes/constraints, timing runs in both modes.
  • Doublets side: functionally equivalent data model and queries (or pipelines), plus timing runs.
  • Deliver reproducible scripts, Docker assets, and result artifacts (logs/CSVs).

Deliverables

  • [docker]/docker-compose.yml pinned to postgres:18.
  • [docker]/compose.embedded.yml (override) for embedded-like mode (WAL-light).
  • [docker]/init/99_unlogged.sql (optional) to switch data tables to UNLOGGED in embedded-like runs.
  • [docs]/HOWTO.md — runbook for Docker, generation parameters, and how to run both benchmark modes.
  • [sql]/10_timetable_queries.sql — queries below.
  • [bench]/pg/run.sh — timings (EXPLAIN (ANALYZE, BUFFERS) + wall-clock), CSV output.
  • [bench]/doublets/run.* — equivalent Doublets operations, same CSV schema.
  • [bench]/schema-mapping.md — mapping Airlines entities/fields to Doublets structures.
  • README update with a “Benchmark: Timetable” section linking to the above.

Prerequisites (PostgreSQL)

  • PostgreSQL 18 (pinned) via Docker image postgres:18.
  • Disk space per chosen dataset size (3m/6m/1y/2y).
  • Extensions typically used by the generator: btree_gist, cube, earthdistance, dblink.

Containerized Environment

Durable (default) compose

# [docker]/docker-compose.yml
version: "3.9"
services:
  pg:
    image: postgres:18
    container_name: airlines-pg18
    environment:
      POSTGRES_USER: postgres
      POSTGRES_PASSWORD: postgres
      POSTGRES_DB: postgres
    ports: ["5432:5432"]
    volumes:
      - pgdata:/var/lib/postgresql/data
      - ./init:/docker-entrypoint-initdb.d
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres"]
      interval: 5s
      timeout: 5s
      retries: 20
volumes:
  pgdata:

Embedded-like (WAL-light) override

This mode trades durability for speed to better match embedded DB behavior.

# [docker]/compose.embedded.yml
services:
  pg:
    command:
      - "postgres"
      - "-c"; "fsync=off"
      - "-c"; "synchronous_commit=off"
      - "-c"; "full_page_writes=off"
      - "-c"; "wal_level=minimal"

Optional (strongly recommended for WAL-free data writes): switch large tables to UNLOGGED in embedded mode.

-- [docker]/init/99_unlogged.sql (only applied when you want embedded-like runs)
DO $$
DECLARE r record;
BEGIN
  FOR r IN
    SELECT format('%I.%I', n.nspname, c.relname) AS fqname
    FROM pg_class c
    JOIN pg_namespace n ON n.oid = c.relnamespace
    WHERE n.nspname = 'bookings' AND c.relkind = 'r'
  LOOP
    EXECUTE 'ALTER TABLE '|| r.fqname || ' SET UNLOGGED';
  END LOOP;
END$$;

Revert to durable: replace SET UNLOGGED with SET LOGGED.

Run:

# Durable
docker compose -f docker/docker-compose.yml up -d

# Embedded-like (WAL-light)
docker compose -f docker/docker-compose.yml -f docker/compose.embedded.yml up -d

Setup Options

Option A — Load a ready snapshot

gunzip -c demo-YYYYMMDD-1y.sql.gz \
| docker compose -f docker/docker-compose.yml exec -T pg psql -U postgres

Recreates demo inside the container.

Option B — Generate your own “big data” (recommended)

From psql in the container (any DB, not demo yet):

\i install;  -- creates DB demo, schemas, extensions

-- Example: generate 1 year with 4 workers
CALL generate(now(), now() + interval '1 year', 4);

-- Progress
SELECT busy();

-- Post-checks
\i check.sql;

What to implement (PostgreSQL)

A. Timetable queries (using the provided view)

bookings.timetable already handles airport-local times and the temporal validity join.

-- Departures from SVO on 2025-10-07
SELECT *
FROM bookings.timetable
WHERE departure_airport = 'SVO'
  AND (scheduled_departure AT TIME ZONE 'UTC')::date = DATE '2025-10-07'
ORDER BY scheduled_departure;

-- Arrivals to SVO on 2025-10-07
SELECT *
FROM bookings.timetable
WHERE arrival_airport = 'SVO'
  AND (scheduled_arrival AT TIME ZONE 'UTC')::date = DATE '2025-10-07'
ORDER BY scheduled_arrival;

-- Next available flight after model-time “now” for a route
SELECT *
FROM bookings.timetable t
WHERE t.departure_airport = 'SVX'
  AND t.arrival_airport   = 'WUH'
  AND t.scheduled_departure > bookings.now()
ORDER BY t.scheduled_departure
LIMIT 1;

B. Manual timetable (without the view, explicit validity check)

SELECT f.flight_id,
       r.route_no,
       r.departure_airport,
       r.arrival_airport,
       f.status,
       f.scheduled_departure,
       f.scheduled_arrival
FROM   bookings.flights  AS f
JOIN   bookings.routes   AS r
  ON   r.route_no = f.route_no
 AND   r.validity @> f.scheduled_departure  -- validity at composition time
WHERE  r.departure_airport = 'SVO'
  AND  f.scheduled_departure::date = DATE '2025-10-07'
ORDER BY f.scheduled_departure;

Benchmark plan

Metrics

  • Wall-clock time (median over N runs).
  • EXPLAIN (ANALYZE, BUFFERS) for PostgreSQL queries (saved to logs).
  • Result row counts and checksum (simple hash) for equivalence.

Datasets

  • Two scales (e.g., 6 months and 1 year).

Procedure

  1. Warm-up: run each query once (discard).
  2. Execute each query 10 runs, record wall-clock ms.
  3. Store results in bench/results/*.csv with columns:
    system,durability_mode,dataset,query_id,run,rows,ms.
  4. Compute summary stats (min/median/p95) and check consistency.

Durability modes

  • durable: default compose (baseline, production-like).
  • embedded: compose override + UNLOGGED tables to eliminate WAL where possible.
    • Caveats: not crash-safe; UNLOGGED tables are truncated after crash/restart; no replication.

Doublets parity

  • Model the same entities/time attributes.
  • Return the same logical result sets as the SQL above.
  • Export CSV with the same columns (include durability_mode = embedded).

Validation

  • SELECT bookings.now(); returns a timestamp (model time).
  • Sanity: row counts by status, min/max scheduled times, airports present.
  • Validity sanity: joining routes to flights without the view must use r.validity @> f.scheduled_departure.
  • Embedded mode check: verify relpersistence='u' for bookings tables:
    SELECT relname, relpersistence
    FROM pg_class
    WHERE relnamespace = 'bookings'::regnamespace AND relkind='r';

Acceptance Criteria

  • Dockerized environment with PostgreSQL 18 (postgres:18) and reproducible startup.
  • Big demo DB available (document exact method & parameters).
  • 10_timetable_queries.sql runnable on the generated dataset.
  • Manual queries that bypass the view include the validity check.
  • Benchmarks run in both durability modes and produce CSVs at two scales.
  • README updated with a short how-to and links to scripts & results.

Nice to have (optional)

  • Timing plots from CSVs.
  • Notes on indexes that materially change plans.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions