Kmonte/add tb for glt#603
Draft
kmontemayor2-sc wants to merge 43 commits intomainfrom
Draft
Conversation
Replace tests for the four-function API with tests for the new TensorBoardWriter class. Tests fail with ImportError until the class lands in the next commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Code review found that test_close_on_noop_writer_does_not_raise only called close() once, so idempotency on the no-op path was untested. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Collapse resolve_tensorboard_log_dir, create_tensorboard_writer, write_tensorboard_scalar, close_tensorboard_writer, and VERTEX_TENSORBOARD_LOG_DIR_ENV_KEY into a single TensorBoardWriter class with from_uri classmethod. The class is context-managerable and no-ops when disabled, eliminating Optional plumbing at call sites. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The V1 BaseTrainer.train body never wrote scalars via TF's ambient default writer, so the file_writer + as_default() block in __run_training served no purpose. Verified with grep: no tf.summary.scalar or write_tensorboard_scalar callers anywhere in gigl/src/training/v1/. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace function-based tensorboard helpers with the new TensorBoardWriter class. Collapses two back-to-back scalar writes per log step into a single dict-style log() call. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…riter Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nsorBoardWriter Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…TensorBoardWriter Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The default for the proto bool field is false, so removing the line preserves the same behavior. Avoids implying that tensorboard logging should be on by default in example task configs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Relocate from gigl/src/common/utils/tensorboard.py (internal pipeline utilities) to gigl/utils/tensorboard_writer.py (general-purpose user utilities). Update the test file path, the in-test patch paths, and all four example training scripts to import from the new location. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Examples now log to TensorBoard whenever a tensorboard_log_uri is set on the gbml config; no separate boolean to flip. Removes the should_log_to_tensorboard field from TrainingProcessArgs (and its docstring entry, extraction from the gbml config, and pass-through to _training_process) in all four link-prediction example training scripts. TensorBoardWriter.from_uri already returns a no-op writer when the URI is None, so a single signal (the URI) is sufficient. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
V1 and V2 trainer launchers previously gated tensorboard_logs_uri construction on both should_log_to_tensorboard AND a non-empty proto field. Drop the should_log_to_tensorboard half so the URI is forwarded to the Vertex AI launcher whenever the proto field is set. One signal (URI presence) instead of two. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
TensorBoardWriter now reads AIP_TENSORBOARD_LOG_DIR directly and raises when unset. Vertex AI populates that env var from CustomJobSpec.baseOutputDirectory, which GiGL's launcher already derives from tensorboardLogsUri — so the prior fallback in _resolve_log_dir was dead weight. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ning Now that TensorBoardWriter reads AIP_TENSORBOARD_LOG_DIR directly via from_env(), the example no longer needs to plumb tensorboard_logs_uri from the GbmlConfig through TrainingProcessArgs to the worker process. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Now that TensorBoardWriter reads AIP_TENSORBOARD_LOG_DIR directly via from_env(), the examples no longer need to plumb tensorboard_logs_uri from the GbmlConfig through TrainingProcessArgs to the worker process. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…_name Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… set Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…oard_logs_uri" This reverts commit 99ab56d.
…t_name
Vertex AI MetadataStore Context IDs (which back ExperimentRuns) must match
[a-z0-9][a-z0-9-]{0,127}. GiGL's gigl_train_<task>_<timestamp> job names
contain underscores and were rejected with a 400 from the SDK. Lowercase and
replace underscores with hyphens for the run ID; fail fast with a clear
message when the user-supplied tensorboard_experiment_name itself is invalid
(don't silently rewrite user input).
Passing experiment_run=<string> invokes aiplatform.ExperimentRun's getter, which 404s when the run doesn't exist yet (every first submit does). Per the SDK's own contract — "If 'experiment' is set but 'experiment_run' is not, an ExperimentRun resource will still be auto-generated" (jobs.py:2509-2514) — the right move is to set only experiment= and let the SDK create the run. The user-supplied experiment name is still validated up front against Vertex's resource-ID regex; the run-id sanitizer is no longer needed.
…path Vertex's auto-uploader is gated on submit(tensorboard=); submit(experiment=) alone leaves AIP_TENSORBOARD_LOG_DIR populated but never reads from it, so events sit in GCS un-uploaded (job 6570151780682825728 is the smoking gun). The two submit kwargs are mutually exclusive, and tensorboard= forces a job-scoped experiment we can't rename — neither delivers multi-run comparison. Switch approach: when tensorboard_experiment_name is set, the launcher injects GIGL_TENSORBOARD_RESOURCE_NAME and GIGL_TENSORBOARD_EXPERIMENT_NAME into the worker container, and submits with NEITHER experiment= nor tensorboard=. The trainer's TensorBoardWriter.from_env reads those env vars on the chief rank and starts aiplatform.start_upload_tb_log to stream AIP_TENSORBOARD_LOG_DIR to the named TensorboardExperiment. close() pairs with end_upload_tb_log. Multiple jobs sharing the experiment name show as comparable runs on one TB page. Drops the now-unused _ensure_experiment_with_backing_tb helper (the SDK uploader auto-creates the TensorboardExperiment).
The "Open TensorBoard" link on the Vertex AI job page is gated on jobSpec.tensorboard. Dropping that field (which the previous experiment-name branch did) hides the link entirely. Make _submit_job set tensorboard= any time the resource name is configured. The chief-rank uploader continues to stream to the user-named TensorboardExperiment in parallel for cross-job comparison. Refresh the now-stale dual-uploader story in the dataclass docstring, launcher comment block, and proto comment. Also adds the round-2 plan to docs/plans/.
…runs The chief-rank aiplatform.start_upload_tb_log uploader was watching AIP_TENSORBOARD_LOG_DIR with no run_name_prefix. The SDK's LogdirLoader maps root-logdir events to DEFAULT_RUN_NAME = "default", and the SDK silently merges into existing runs by name — so two jobs sharing the same TensorboardExperiment would collapse into one "default" run instead of producing two comparable runs (codex round-2 issue 1). Approach: - Launcher injects GIGL_TENSORBOARD_RUN_NAME = sanitized(job_name) + utc timestamp. The timestamp ensures launch-unique even across reruns of the same applied_task_identifier. - Writer uses <AIP_TENSORBOARD_LOG_DIR>/<run_name>/ as the file-writer directory; the uploader still watches the parent so the SDK's LogdirLoader discovers the subdir as a TensorboardRun via os.path.relpath. - Sanitization mirrors the SDK's reformat_run_name regex so the GCS subdir name and the SDK-derived run name agree (codex round-2 issue 5). - from_env constructs the file writer first, then attempts to start the uploader; on failure the writer is closed before re-raising — no leaked uploader thread (codex round-2 issue 6). - launch_single_pool_job now returns the submitted CustomJob so the smoke script can look up the per-job TensorboardExperiment (codex round-2 issue 3).
…ing entrypoints Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bypasses ConfigPopulator and the full pipeline, submitting a tiny n1-standard-2 CustomJob via the production launch_single_pool_job path so env-var injection and submit-side wiring exercise the same code as a real trainer. Verifies the TensorBoard story end-to-end via the API: - Per-job auto-named TensorboardExperiment exists, has runs, runs have TensorboardTimeSeries (R1 + scalar ingestion). - When --experiment-name is passed, the named experiment also exists with ≥1 run and ≥1 time series (R2). --container-uri is required so the smoke loop tests branch-local code, not a released image (codex round-2 issue 2). The launcher's launch_single_pool_job now returns the CustomJob (codex round-2 issue 3) so the script can resolve the auto-named experiment from the job's numeric ID.
Run via: python -m gigl.utils.dev.submit_smoke_job ...
After every CustomJob submission, log the per-job TensorboardExperiment URL (name == job's numeric ID) and, when the user opted into a stable TensorboardExperiment, also log the cross-job comparison URL. Saves users having to construct the URL by hand from the project/region/TB-ID/experiment template — particularly useful for the cross-job link, which the VAI job UI does NOT surface (the "Open TensorBoard" button there only resolves to the per-job experiment).
The checked-in e2e CORA task config previously hard-coded ``tensorboardExperimentName: "kmonte-test-experiment"``. That value is personal to one developer's debugging, AND any e2e/CI resource config without ``tensorboard_resource_name`` would now fail validation because the experiment-name check requires a backing TB resource (codex round-2 issue 4). Leave the field unset in the example so the default e2e CORA test stays compatible with all resource configs. Users opt into cross-job comparison by setting ``tensorboardExperimentName`` in their own task config copy alongside a TB resource on the trainer resource config. Also folds in ``make format`` output across the branch (mostly mdformat-driven markdown reflow).
The smoke launcher and tb_smoke_main were used to validate the multi-run TB design end-to-end during this branch's development. Now that R1 + R2 are validated and the production trainers carry the same behavior, drop the dev tooling so it doesn't ship in the public package. The plan at docs/plans/20260505-tb-multi-job-iteration.md retains a record of the iteration loop for posterity.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Scope of work done
Where is the documentation for this feature?: N/A
Did you add automated tests or write a test plan?
Updated Changelog.md? NO
Ready for code review?: NO