Add comprehensive logging with IRIS_LOG_LEVEL support by mawad-amd · Pull Request #522 · ROCm/iris

mawad-amd · 2026-04-27T05:24:14Z

Summary

Add IRIS_LOG_LEVEL env var (DEBUG/INFO/WARNING/ERROR) to control iris log verbosity
Add _log_rank() helper for rank-aware logging with lazy formatting
Instrument 17 files across CCL ops, symmetric heap, allocators, distributed helpers, kernel launches, and fused ops with two-tier logging (DEBUG for tracing, INFO for lifecycle events)
Show [module] bracket only for internal iris logs, not user-facing ctx.info() calls

At default INFO level, only 4-5 lifecycle lines appear during init. At DEBUG, full tracing of every CCL call, kernel launch, allocation, and barrier.

Test plan

ruff check iris/ && ruff format --check iris/
IRIS_LOG_LEVEL=DEBUG torchrun --nproc_per_node=4 tests/run_tests_distributed.py tests/ccl/test_all_reduce.py -v — verify debug output with [Iris] prefix
Default INFO level — no visible logging noise, no test regressions

🤖 Generated with Claude Code

Instrument 17 files with ~30 log lines covering CCL ops, symmetric heap, allocators, distributed helpers, HIP platform, kernel launches, and fused ops. Two-tier logging: DEBUG for entry-point tracing, INFO for lifecycle events (init, peer access, IPC). Enhanced log format with timestamp, level, module name, and rank. Add IRIS_LOG_LEVEL env var to control level at import time. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

_log_rank() now captures the caller's filename via sys._getframe so the formatter can show [module]. User-facing ctx.info()/ctx.debug() calls go through _log_with_rank(pathname="") so the module bracket is omitted. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The distributed_allgather, distributed_broadcast_tensor, and distributed_barrier functions were calling _log_rank without rank= and num_ranks= kwargs, causing log lines to show ?/? instead of the actual rank info. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds rank-aware, env-configurable logging across Iris distributed ops and runtime to improve debuggability while keeping default output low-noise.

Changes:

Introduces IRIS_LOG_LEVEL env override and _log_rank() helper for rank-aware internal logging.
Updates IrisFormatter to include timestamps/levels/rank and module information.
Instruments CCL ops, matmul collectives, allocators/symmetric heap, kernel launch tracing, and distributed helpers with DEBUG/INFO logs.

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
iris/host/logging/logging.py	Adds `IRIS_LOG_LEVEL`, expands formatter output, and introduces `_log_rank()` helper.
iris/host/tracing/kernel_artifacts.py	Adds DEBUG tracing for kernel launches via `_log_rank()`.
iris/host/platform/hip.py	Logs HIP/CUDA errors and DMA-BUF export at ERROR/DEBUG.
iris/host/memory/symmetric_heap.py	Adds INFO/DEBUG lifecycle tracing for heap init, allocations, and peer refresh.
iris/host/memory/allocators/vmem_allocator.py	Adds INFO init and DEBUG allocation tracing.
iris/host/memory/allocators/torch_allocator.py	Adds INFO init, DEBUG allocation tracing, and ERROR OOM log.
iris/host/iris.py	Adds INFO init log and DEBUG barrier trace.
iris/host/distributed/helpers.py	Adds DEBUG tracing for allgather/broadcast/barrier helpers.
iris/host/distributed/fd_passing.py	Adds DEBUG tracing for FD infrastructure setup.
iris/ccl/all_reduce.py	Adds DEBUG tracing for all-reduce preamble and main call.
iris/ccl/all_gather.py	Adds DEBUG tracing for all-gather entry.
iris/ccl/all_to_all.py	Adds DEBUG tracing for all-to-all entry.
iris/ccl/reduce_scatter.py	Adds DEBUG tracing for reduce-scatter entry.
iris/ops/matmul_all_reduce.py	Adds DEBUG tracing for matmul all-reduce entry.
iris/ops/matmul_all_gather.py	Adds DEBUG tracing for matmul all-gather entry.
iris/ops/matmul_reduce_scatter.py	Adds DEBUG tracing for matmul reduce-scatter entry.
iris/ops/all_gather_matmul.py	Adds DEBUG tracing for fused all-gather matmul entry.

Copilot · 2026-04-27T05:31:39Z

+        rank = getattr(record, "iris_rank", "?")
+        num_ranks = getattr(record, "iris_num_ranks", "?")
+        ts = self.formatTime(record, "%H:%M:%S")
+        level = record.levelname
+        module = record.module
+        if module:
+            return f"{ts} {level:<5s} [Iris] [{rank}/{num_ranks}] [{module}] {record.getMessage()}"
+        return f"{ts} {level:<5s} [Iris] [{rank}/{num_ranks}] {record.getMessage()}"


The formatter always includes the [{module}] bracket whenever record.module is present. This likely affects user-facing logs too (e.g., ctx.info()), leaking the user module/script name, which conflicts with the PR description (“Show [module] bracket only for internal iris logs, not user-facing ctx.info() calls”). A concrete fix is to conditionally include [{module}] only when a dedicated flag (e.g., record.iris_internal = True) is set by _log_rank(), and omit it for regular logger.info(...) calls.

Copilot · 2026-04-27T05:31:39Z

+        )
+        if rank is not None:
+            record.iris_rank = rank
+            record.iris_num_ranks = num_ranks


_log_rank() sets record.iris_num_ranks = num_ranks whenever rank is provided, even if num_ranks is None. Since IrisFormatter uses getattr(record, \"iris_num_ranks\", \"?\"), the presence of the attribute causes the output to render None instead of ? (e.g., calls like iris_launch(..., rank=rank) that don’t pass num_ranks). Fix by only setting iris_num_ranks when num_ranks is not None, or normalizing it to \"?\"/-1 when missing.

Suggested change

record.iris_num_ranks = num_ranks

if num_ranks is not None:

record.iris_num_ranks = num_ranks

Copilot · 2026-04-27T05:31:40Z

+        _log_rank(
+            logging.INFO,
+            "refresh_peer_access: done heap_bases=[%s]",
+            ", ".join(f"0x{int(self.heap_bases[r].item()):x}" for r in range(min(3, self.num_ranks))),
+            rank=self.cur_rank,
+            num_ranks=self.num_ranks,
+        )


The third argument to _log_rank() is eagerly computed before the function call, which forces up to 3 .item() device-to-host synchronizations on self.heap_bases even when INFO logging is later disabled (and even when it’s enabled, it adds synchronization overhead). To keep logging truly low-overhead, avoid precomputing the joined string; instead guard the computation behind a level check (e.g., if logger.isEnabledFor(logging.INFO): ...) or pass cheaper-to-compute arguments (e.g., cached CPU copies) so .item() isn’t executed as part of argument evaluation.

Suggested change

_log_rank(

logging.INFO,

"refresh_peer_access: done heap_bases=[%s]",

", ".join(f"0x{int(self.heap_bases[r].item()):x}" for r in range(min(3, self.num_ranks))),

rank=self.cur_rank,

num_ranks=self.num_ranks,

)

if logging.getLogger(__name__).isEnabledFor(logging.INFO):

heap_bases_preview = ", ".join(

f"0x{int(self.heap_bases[r].item()):x}" for r in range(min(3, self.num_ranks))

)

_log_rank(

logging.INFO,

"refresh_peer_access: done heap_bases=[%s]",

heap_bases_preview,

rank=self.cur_rank,

num_ranks=self.num_ranks,

)

Copilot · 2026-04-27T05:31:40Z

+        _log_rank(
+            logging.INFO,
+            "refresh_peer_access: start rank=%d num_ranks=%d",
+            self.cur_rank,
+            self.num_ranks,
+            rank=self.cur_rank,
+            num_ranks=self.num_ranks,
+        )


refresh_peer_access() logs at INFO on every invocation (“start” and also “done” later). In this file, allocate() calls self.refresh_peer_access() for every allocation (see context excerpt), so default INFO mode will emit repeated lifecycle lines and contradict the PR description that default INFO produces only ~4–5 init lines. Consider lowering these to DEBUG, or logging INFO only on first initialization / when topology changes, to keep default output low-noise.

Copilot · 2026-04-27T05:31:40Z

+        self._log_with_rank(
+            logging.INFO,
+            f"init: heap_size={heap_size / (1 << 30):.1f}GB rank={cur_rank}/{num_ranks} allocator={allocator_type}",
+        )


This uses an f-string, which eagerly formats the message even if INFO logging is disabled (and the PR description mentions “lazy formatting”). Prefer passing a format string plus arguments (or extending _log_with_rank to accept msg, *args like _log_rank) so formatting work is skipped when the level isn’t enabled.

Suggested change

self._log_with_rank(

logging.INFO,

f"init: heap_size={heap_size / (1 << 30):.1f}GB rank={cur_rank}/{num_ranks} allocator={allocator_type}",

)

if logger.isEnabledFor(logging.INFO):

self._log_with_rank(

logging.INFO,

f"init: heap_size={heap_size / (1 << 30):.1f}GB rank={cur_rank}/{num_ranks} allocator={allocator_type}",

)

mawad-amd and others added 3 commits April 26, 2026 22:23

Copilot AI review requested due to automatic review settings April 27, 2026 05:24

mawad-amd requested review from BKP and neoblizz as code owners April 27, 2026 05:24

github-actions Bot added in-progress We are working on it iris Iris project issue labels Apr 27, 2026

Copilot AI reviewed Apr 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add comprehensive logging with IRIS_LOG_LEVEL support#522

Add comprehensive logging with IRIS_LOG_LEVEL support#522
mawad-amd wants to merge 3 commits intomainfrom
muhaawad/iris-logging

mawad-amd commented Apr 27, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 27, 2026

Uh oh!

Copilot AI Apr 27, 2026

Uh oh!

Copilot AI Apr 27, 2026

Uh oh!

Copilot AI Apr 27, 2026

Uh oh!

Copilot AI Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	record.iris_num_ranks = num_ranks
	if num_ranks is not None:
	record.iris_num_ranks = num_ranks

Conversation

mawad-amd commented Apr 27, 2026

Summary

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants