Skip to content

Kmonte/add tb for glt#603

Draft
kmontemayor2-sc wants to merge 43 commits intomainfrom
kmonte/add-tb-for-glt
Draft

Kmonte/add tb for glt#603
kmontemayor2-sc wants to merge 43 commits intomainfrom
kmonte/add-tb-for-glt

Conversation

@kmontemayor2-sc
Copy link
Copy Markdown
Collaborator

Scope of work done

Where is the documentation for this feature?: N/A

Did you add automated tests or write a test plan?

Updated Changelog.md? NO

Ready for code review?: NO

kmontemayor and others added 30 commits April 28, 2026 18:24
Replace tests for the four-function API with tests for the new
TensorBoardWriter class. Tests fail with ImportError until the class
lands in the next commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Code review found that test_close_on_noop_writer_does_not_raise only
called close() once, so idempotency on the no-op path was untested.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Collapse resolve_tensorboard_log_dir, create_tensorboard_writer,
write_tensorboard_scalar, close_tensorboard_writer, and
VERTEX_TENSORBOARD_LOG_DIR_ENV_KEY into a single TensorBoardWriter class
with from_uri classmethod. The class is context-managerable and no-ops
when disabled, eliminating Optional plumbing at call sites.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The V1 BaseTrainer.train body never wrote scalars via TF's ambient
default writer, so the file_writer + as_default() block in __run_training
served no purpose. Verified with grep: no tf.summary.scalar or
write_tensorboard_scalar callers anywhere in gigl/src/training/v1/.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace function-based tensorboard helpers with the new
TensorBoardWriter class. Collapses two back-to-back scalar writes per
log step into a single dict-style log() call.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…riter

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nsorBoardWriter

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…TensorBoardWriter

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The default for the proto bool field is false, so removing the line
preserves the same behavior. Avoids implying that tensorboard logging
should be on by default in example task configs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Relocate from gigl/src/common/utils/tensorboard.py (internal pipeline
utilities) to gigl/utils/tensorboard_writer.py (general-purpose user
utilities). Update the test file path, the in-test patch paths, and
all four example training scripts to import from the new location.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Examples now log to TensorBoard whenever a tensorboard_log_uri is set on
the gbml config; no separate boolean to flip. Removes the
should_log_to_tensorboard field from TrainingProcessArgs (and its
docstring entry, extraction from the gbml config, and pass-through to
_training_process) in all four link-prediction example training scripts.
TensorBoardWriter.from_uri already returns a no-op writer when the URI
is None, so a single signal (the URI) is sufficient.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
V1 and V2 trainer launchers previously gated tensorboard_logs_uri
construction on both should_log_to_tensorboard AND a non-empty proto
field. Drop the should_log_to_tensorboard half so the URI is forwarded
to the Vertex AI launcher whenever the proto field is set. One signal
(URI presence) instead of two.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
TensorBoardWriter now reads AIP_TENSORBOARD_LOG_DIR directly and raises
when unset. Vertex AI populates that env var from
CustomJobSpec.baseOutputDirectory, which GiGL's launcher already derives
from tensorboardLogsUri — so the prior fallback in _resolve_log_dir was
dead weight.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ning

Now that TensorBoardWriter reads AIP_TENSORBOARD_LOG_DIR directly via
from_env(), the example no longer needs to plumb tensorboard_logs_uri
from the GbmlConfig through TrainingProcessArgs to the worker process.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Now that TensorBoardWriter reads AIP_TENSORBOARD_LOG_DIR directly via
from_env(), the examples no longer need to plumb tensorboard_logs_uri
from the GbmlConfig through TrainingProcessArgs to the worker process.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…_name

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… set

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
kmontemayor and others added 13 commits May 4, 2026 19:07
…t_name

Vertex AI MetadataStore Context IDs (which back ExperimentRuns) must match
[a-z0-9][a-z0-9-]{0,127}. GiGL's gigl_train_<task>_<timestamp> job names
contain underscores and were rejected with a 400 from the SDK. Lowercase and
replace underscores with hyphens for the run ID; fail fast with a clear
message when the user-supplied tensorboard_experiment_name itself is invalid
(don't silently rewrite user input).
Passing experiment_run=<string> invokes aiplatform.ExperimentRun's getter,
which 404s when the run doesn't exist yet (every first submit does). Per the
SDK's own contract — "If 'experiment' is set but 'experiment_run' is not, an
ExperimentRun resource will still be auto-generated" (jobs.py:2509-2514) —
the right move is to set only experiment= and let the SDK create the run.
The user-supplied experiment name is still validated up front against
Vertex's resource-ID regex; the run-id sanitizer is no longer needed.
…path

Vertex's auto-uploader is gated on submit(tensorboard=); submit(experiment=)
alone leaves AIP_TENSORBOARD_LOG_DIR populated but never reads from it, so
events sit in GCS un-uploaded (job 6570151780682825728 is the smoking gun).
The two submit kwargs are mutually exclusive, and tensorboard= forces a
job-scoped experiment we can't rename — neither delivers multi-run
comparison.

Switch approach: when tensorboard_experiment_name is set, the launcher
injects GIGL_TENSORBOARD_RESOURCE_NAME and GIGL_TENSORBOARD_EXPERIMENT_NAME
into the worker container, and submits with NEITHER experiment= nor
tensorboard=. The trainer's TensorBoardWriter.from_env reads those env vars
on the chief rank and starts aiplatform.start_upload_tb_log to stream
AIP_TENSORBOARD_LOG_DIR to the named TensorboardExperiment. close() pairs
with end_upload_tb_log. Multiple jobs sharing the experiment name show as
comparable runs on one TB page.

Drops the now-unused _ensure_experiment_with_backing_tb helper (the SDK
uploader auto-creates the TensorboardExperiment).
The "Open TensorBoard" link on the Vertex AI job page is gated on
jobSpec.tensorboard. Dropping that field (which the previous experiment-name
branch did) hides the link entirely. Make _submit_job set tensorboard= any
time the resource name is configured. The chief-rank uploader continues to
stream to the user-named TensorboardExperiment in parallel for cross-job
comparison.

Refresh the now-stale dual-uploader story in the dataclass docstring,
launcher comment block, and proto comment. Also adds the round-2 plan to
docs/plans/.
…runs

The chief-rank aiplatform.start_upload_tb_log uploader was watching
AIP_TENSORBOARD_LOG_DIR with no run_name_prefix. The SDK's LogdirLoader
maps root-logdir events to DEFAULT_RUN_NAME = "default", and the SDK
silently merges into existing runs by name — so two jobs sharing the same
TensorboardExperiment would collapse into one "default" run instead of
producing two comparable runs (codex round-2 issue 1).

Approach:
- Launcher injects GIGL_TENSORBOARD_RUN_NAME = sanitized(job_name) + utc
  timestamp. The timestamp ensures launch-unique even across reruns of
  the same applied_task_identifier.
- Writer uses <AIP_TENSORBOARD_LOG_DIR>/<run_name>/ as the file-writer
  directory; the uploader still watches the parent so the SDK's
  LogdirLoader discovers the subdir as a TensorboardRun via os.path.relpath.
- Sanitization mirrors the SDK's reformat_run_name regex so the GCS subdir
  name and the SDK-derived run name agree (codex round-2 issue 5).
- from_env constructs the file writer first, then attempts to start the
  uploader; on failure the writer is closed before re-raising — no leaked
  uploader thread (codex round-2 issue 6).
- launch_single_pool_job now returns the submitted CustomJob so the smoke
  script can look up the per-job TensorboardExperiment (codex round-2
  issue 3).
…ing entrypoints

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bypasses ConfigPopulator and the full pipeline, submitting a tiny n1-standard-2
CustomJob via the production launch_single_pool_job path so env-var injection
and submit-side wiring exercise the same code as a real trainer. Verifies the
TensorBoard story end-to-end via the API:
- Per-job auto-named TensorboardExperiment exists, has runs, runs have
  TensorboardTimeSeries (R1 + scalar ingestion).
- When --experiment-name is passed, the named experiment also exists with
  ≥1 run and ≥1 time series (R2).

--container-uri is required so the smoke loop tests branch-local code, not a
released image (codex round-2 issue 2). The launcher's launch_single_pool_job
now returns the CustomJob (codex round-2 issue 3) so the script can resolve
the auto-named experiment from the job's numeric ID.
Run via: python -m gigl.utils.dev.submit_smoke_job ...
After every CustomJob submission, log the per-job TensorboardExperiment URL
(name == job's numeric ID) and, when the user opted into a stable
TensorboardExperiment, also log the cross-job comparison URL. Saves users
having to construct the URL by hand from the project/region/TB-ID/experiment
template — particularly useful for the cross-job link, which the VAI job UI
does NOT surface (the "Open TensorBoard" button there only resolves to the
per-job experiment).
The checked-in e2e CORA task config previously hard-coded
``tensorboardExperimentName: "kmonte-test-experiment"``. That value is
personal to one developer's debugging, AND any e2e/CI resource config
without ``tensorboard_resource_name`` would now fail validation because
the experiment-name check requires a backing TB resource (codex round-2
issue 4).

Leave the field unset in the example so the default e2e CORA test stays
compatible with all resource configs. Users opt into cross-job comparison
by setting ``tensorboardExperimentName`` in their own task config copy
alongside a TB resource on the trainer resource config.

Also folds in ``make format`` output across the branch (mostly
mdformat-driven markdown reflow).
The smoke launcher and tb_smoke_main were used to validate the
multi-run TB design end-to-end during this branch's development. Now
that R1 + R2 are validated and the production trainers carry the same
behavior, drop the dev tooling so it doesn't ship in the public package.
The plan at docs/plans/20260505-tb-multi-job-iteration.md retains a
record of the iteration loop for posterity.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant