diff --git a/.claude/skills/vai-debug/SKILL.md b/.claude/skills/vai-debug/SKILL.md new file mode 100644 index 000000000..cb6ecb6ac --- /dev/null +++ b/.claude/skills/vai-debug/SKILL.md @@ -0,0 +1,418 @@ +______________________________________________________________________ + +## name: vai-debug description: Use when a Vertex AI custom job has failed (RPC timeout, OOM, NCCL error, silent stall, etc.) and the user gives you the numeric job id and asks you to figure out why. Downloads metadata + logs, reads the entrypoint code, reconstructs a fact-based timeline with verbatim log citations, and writes a grounded debug doc with a dataflow diagram and an actionable next-step plan. + +# Vertex AI Distributed Job Debug + +Investigate a failed Vertex AI custom job by downloading its metadata + logs, reading the entrypoint code, +reconstructing a fact-based timeline, and writing up a summary doc with citations, a dataflow diagram, and an actionable +next-step plan. + +## When to use + +Invoke this skill when the user gives you a Vertex AI **custom job ID** (a long numeric string like +`5602738366085857280`) and asks you to figure out why it crashed or is behaving badly. It works best for: + +- distributed training/inference jobs across `workerpool0/1/2/...` +- crashes that surface as RPC timeouts, OOMs, NCCL errors, or silent stalls +- jobs where the failing replica is one of many and you need to triage + +It is NOT a substitute for plain log skimming when the user has already pinpointed the bug — go directly to +`gcloud logging read` in that case. + +## Inputs + +`$ARGUMENTS` should contain: + +1. `` — **required**. The numeric Vertex AI custom job ID. +2. `project=` — **required**. The GCP project the job ran in. If the user did not provide it, ask them + before running anything. Do not guess, do not default to a project from a previous conversation, and do not hardcode + any specific project in this skill. +3. `main_fn=` — optional. The entrypoint file (e.g. `examples/link_prediction/heterogeneous_training.py`). If + omitted, infer it from the job's `containerSpec.args` / `containerSpec.command` fields after step 2 below. + +If the user passes a Cloud Console URL instead of a job ID, extract the ID from either form the console / GiGL may emit: + +- query-param form: `?job_id=` (older console URLs) +- path-segment form: `/training/?project=...` — this is what GiGL itself logs (see + `gigl/common/services/vertex_ai.py:362`), so it is the most likely paste + +The project (`?project=...`) and region (`/locations//`) are also typically embedded in the URL — but still +confirm both with the user rather than parsing them silently. + +## Region + +If the user mentioned a region in this conversation, use that. Otherwise ask. `us-central1` is the most common region +but do not assume it without confirmation. + +## Workspace + +All artifacts go under `.tmp/job__/` in the GiGL repo root, where `` is the job's `displayName` +lowercased with non-alphanumerics replaced by `_` (e.g. +`.tmp/job_5602738366085857280_link_prediction_training_run_42/`). **Always include both the numeric job id and the +human-readable name** so the directory is greppable, listable, and recognizable by either. + +`.tmp/` is gitignored at the repo root — both the directory and its contents are local-only and never committed. +`mkdir -p .tmp/job__/` before writing anything. Inside it expect to land: + +- `metadata.json` — full `gcloud ai custom-jobs describe` output +- `logs.json`, `logs_mid.json`, `logs_tail.json` — sharded raw logs +- `analyze.out` — captured stdout from your reusable analysis script (the script itself lives at `tools/ai/.py`, + see step 5) + +The final write-up goes to `docs/YYYYMMDD-vai_debug_job__.md` (date-prefixed, snake_case, both job id + +slug per the user's naming preference). If a date-prefixed file already exists for this job, append `-2`, `-3`, etc. + +## Analysis tooling + +**Prefer creating well-named, reusable scripts in `tools/ai/.py` over per-job throwaways.** Examples +of good names: `tools/ai/filter_vai_logs_by_replica.py`, `tools/ai/summarize_vai_log_errors.py`, +`tools/ai/extract_first_last_log_per_rank.py`. + +`tools/` is gitignored, so `tools/ai/` is **local-only** — these scripts persist on this machine across debug sessions +but are not shared via git. The goal is to build up a small local library of VAI debugging tools so the next failed job +(and the next Claude instance to debug one on this machine) can reuse them instead of rewriting from scratch. + +To make this work, every script in `tools/ai/` MUST be: + +- **Well-documented at the top.** A module docstring with: what it does, what inputs it takes (CLI args), what output + format it produces, and at least one example invocation. A future session should be able to read the docstring alone + and decide whether the script fits the current job — without re-reading the source. +- **Parameterized via `argparse`** (or similar). Take `--job-dir` / `--logs-dir` / `--filter` etc. as CLI args. Do + **not** hardcode `.tmp/job__/` or any single job's specifics — the same script must work across jobs. +- **Composable.** Prefer multiple small focused scripts (one for filtering by replica, one for deduping errors, one for + extracting timestamps per rank) over one monolithic `analyze.py`. Each script does one thing well and prints to stdout + so the next can pipe it. + +If you find logic that is genuinely one-off and won't help future jobs, scope it to a small inline section and put a +`# job-specific:` comment over it — but the rest of the script should remain reusable. + +Before writing a new script, run `mkdir -p tools/ai` (the directory may not exist yet on a fresh checkout) and then +**list `tools/ai/`** and read the docstrings of existing scripts. An empty directory means no reusable tools yet — that +is fine, just create the first one. If an existing script already covers what you need, reuse it. Only create a new +script when no existing one fits. + +______________________________________________________________________ + +## Instructions + +Execute the steps below in order. **Do not skip ahead.** Each step's output feeds the next. If a step fails (e.g. the +job ID doesn't exist, logs are empty), tell the user clearly and stop — don't fabricate. + +### 1. Download metadata + +```bash +mkdir -p .tmp/job__ +gcloud ai custom-jobs describe \ + --project= --region= \ + --format=json > .tmp/job__/metadata.json +``` + +Note: at this exact step you don't yet have the `` (it comes from `displayName` inside the metadata). The first +time, mkdir into `.tmp/job_/`, then `mv` to `.tmp/job__/` after parsing `displayName`. Or write +metadata to a temp file, parse the slug, then create the final dir. + +Then extract and report: + +- `displayName`, `state`, `createTime`, `startTime`, `endTime` +- `error.message` (the top-line crash reason, if any) +- For each `workerPoolSpec`: `machineSpec.machineType`, `machineSpec.acceleratorType`/`acceleratorCount`, + `replicaCount`, `containerSpec.command`, `containerSpec.args`. **Some pools may be empty `{}` placeholders** — GiGL + inserts an empty workerpool1 when graph-store compute has only one replica (see + `gigl/common/services/vertex_ai.py:324`; the integration test in + `tests/integration/common/services/vertex_ai_test.py:96` asserts this). Treat empty pools as "unused placeholder — + skip" and do not try to dereference `machineSpec` or `containerSpec` on them. + +`gcloud ai custom-jobs describe` prefixes its output with `Using endpoint [...]` — the file may have a non-JSON first +line. Strip it before `json.load` if needed (or just use a small script after `tail -n +2`). + +State handling: + +- `JOB_STATE_FAILED` / `JOB_STATE_CANCELLED`: proceed normally. +- `JOB_STATE_RUNNING`: a stuck/stalled live job is in scope ("silent stalls" per the When-to-use section). Proceed, but + note `endTime` will be missing — use the latest log timestamp as the provisional timeline end and label the state as + running everywhere downstream. +- `JOB_STATE_SUCCEEDED`: nothing to debug. Confirm with the user before proceeding. +- Any other state (`PENDING`, `QUEUED`, etc.): confirm with the user. + +**Optional: download task and resource configs.** GiGL injects `--job_name`, `--task_config_uri`, and +`--resource_config_uri` into `containerSpec.args` (see `gigl/src/common/vertex_ai_launcher.py:254`). Trainer args like +`batch_size`, `num_neighbors`, `worker_concurrency`, etc. live in the task config — not in the job metadata (verify in +`examples/link_prediction/heterogeneous_training.py:760`). If your eventual action plan will reference any config knob, +`gsutil cp` the `--task_config_uri` and `--resource_config_uri` values into `.tmp/job__/` and parse them. +Skip this if the failure is clearly a runtime error (NCCL, OOM, segfault) where config knobs are not load-bearing. + +### 2. Identify the entrypoint (if not provided) + +If the user gave `main_fn=`, use it. + +Otherwise, parse `containerSpec.command` (usually `["python", "-m", ""]`) and `containerSpec.args` from +each workerpool to figure out the entrypoint module. Convert the module path to a file path (replace `.` with `/`, +append `.py`). + +Workerpools may have different entrypoints — record them all (compute pool vs storage pool typically differ in +distributed graph-store jobs). + +### 3. Download logs (sharded) + +**Always include explicit timestamp bounds derived from the metadata.** `gcloud logging read` defaults `--freshness=1d` +when no timestamp filter is set in the query, so a job that ran more than a day ago will silently return zero entries +even though Cloud Logging has retained them (default retention is 30 days). This is the most common silent footgun. + +`gcloud logging read` returns at most `--limit` entries per call (100,000 is the hard upper bound). For long-running +jobs, paginate forward by timestamp: + +```bash +# First chunk — bound to the job's actual run window with a small buffer on endTime +gcloud logging read 'resource.type="ml_job" AND resource.labels.job_id="" AND timestamp>="" AND timestamp<=""' \ + --project= --format=json --order=asc --limit=100000 \ + > .tmp/job__/logs.json +``` + +If the job is still running (`endTime` absent), drop the upper bound — or substitute "now" — and add a generous +`--freshness` so live logs aren't cut off. + +Inspect the count and last timestamp of the shard via a small `tools/ai/` script (do NOT use `python3 -c` — see +"Analysis tooling" above). + +If the count came back at the limit (100000), there are more logs. Fetch the next chunk with a **strict** lower bound +(`timestamp>`, not `>=`) so the boundary entry is not re-fetched: + +```bash +gcloud logging read 'resource.type="ml_job" AND resource.labels.job_id="" AND timestamp>"" AND timestamp<=""' \ + --project= --format=json --order=asc --limit=100000 \ + > .tmp/job__/logs_mid.json +``` + +(If the Cloud Logging filter parser rejects strict `>`, fall back to `timestamp>=""` and rely on the +cross-shard `insertId` dedupe in step 5 to drop the duplicate boundary entry — but never skip the dedupe.) + +Repeat with `logs_tail.json` etc. **Stop when EITHER** (a) the shard returned `count < limit`, **OR** (b) the shard +contains zero `insertId`s not already seen in earlier shards. Do NOT loop on "last timestamp past endTime" — silent +stalls and crashed jobs may have no logs anywhere near `endTime`. + +Also try the alternate filter `labels."ml.googleapis.com/custom_job_id"=""` if the first query returns 0 entries +(older jobs use this label). + +### 4. Read the entrypoint code (intention) + +Read each entrypoint file to understand **what the job is supposed to do**. Pay attention to: + +- The high-level lifecycle: setup → train → val → test → save → cleanup +- The data loaders / RPC clients being constructed +- The barriers and `dist.all_gather`/`broadcast` calls +- The shutdown path +- Any retries, timeouts, or fail-fast assertions + +**Do not skim.** Read the function the user pointed at (or the inferred entrypoint) plus the helpers it calls. Note +which file:line ranges define each phase — you'll cite them in the write-up. + +### 5. Read the logs (what actually happened) + +Use the reusable scripts in `tools/ai/` (or create new ones following the "Analysis tooling" rules above). Do not write +a per-job `analyze.py` inside `.tmp/`. + +A typical analysis pipeline needs scripts that can: + +1. Load all log shards and dedupe on `insertId` + +2. Sort by `timestamp` + +3. Extract the message via a small helper that handles both `textPayload` and `jsonPayload.message`: + + ```python + def extract_text(entry): + text = entry.get("textPayload", "") + if text: + return text + jp = entry.get("jsonPayload", {}) + if isinstance(jp, dict): + return jp.get("message", str(jp)) + return str(jp) + ``` + +4. Summarize, at minimum: + + - Job lifecycle events: filter on `entry["resource"]["labels"]["task_name"] == "service"`. Note this is a **resource + label**, not a top-level field on the entry — a literal `entry.get("task_name")` will miss every lifecycle event. + - First and last log timestamps per replica (to find which went silent first). Replica identity also lives in + `resource.labels` (e.g. `task_name == "workerpool1-14"`). + - Init-phase and steady-state progress: **derive the signature strings from the entrypoint code you read in step 4** + rather than hardcoding a phrase list. Different entrypoints log different markers — e.g. + `examples/link_prediction/heterogeneous_training.py:149` logs `finished setting up main loader` and + `examples/link_prediction/heterogeneous_training.py:169` logs `finished setting up random negative loader`. Pick + out the per-rank "loader ready", "model initialized", "first batch", and val-cycle markers actually emitted by the + entrypoint(s) for this job. + - Errors: **case-insensitive** match on + `error|traceback|rpcerr|runtimeerror|broken future|terminated|sigterm|sigkill|oom|out of memory|killed|low on memory|econnreset|eof:|nccl|cuda error|resourceexhausted|deadline exceeded|connection reset by peer|segfault|segmentation fault|no space left|disk quota` + - For storage/sampler jobs: `shared_sampling_scheduler` lines — how long after the first stall did they keep emitting + `steady_state` logs? + +5. **Dedupe noisy errors before printing.** Show the first occurrence of each unique error prefix, the count, and the + time range. Do not dump 10,000 identical traceback lines. + +Each of these can be its own small script in `tools/ai/`, parameterized by `--job-dir .tmp/job__/`. Run +them with `python tools/ai/