Add comprehensive logging with IRIS_LOG_LEVEL support#522
Add comprehensive logging with IRIS_LOG_LEVEL support#522
Conversation
Instrument 17 files with ~30 log lines covering CCL ops, symmetric heap, allocators, distributed helpers, HIP platform, kernel launches, and fused ops. Two-tier logging: DEBUG for entry-point tracing, INFO for lifecycle events (init, peer access, IPC). Enhanced log format with timestamp, level, module name, and rank. Add IRIS_LOG_LEVEL env var to control level at import time. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
_log_rank() now captures the caller's filename via sys._getframe so the formatter can show [module]. User-facing ctx.info()/ctx.debug() calls go through _log_with_rank(pathname="") so the module bracket is omitted. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The distributed_allgather, distributed_broadcast_tensor, and distributed_barrier functions were calling _log_rank without rank= and num_ranks= kwargs, causing log lines to show ?/? instead of the actual rank info. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds rank-aware, env-configurable logging across Iris distributed ops and runtime to improve debuggability while keeping default output low-noise.
Changes:
- Introduces
IRIS_LOG_LEVELenv override and_log_rank()helper for rank-aware internal logging. - Updates
IrisFormatterto include timestamps/levels/rank and module information. - Instruments CCL ops, matmul collectives, allocators/symmetric heap, kernel launch tracing, and distributed helpers with DEBUG/INFO logs.
Reviewed changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| iris/host/logging/logging.py | Adds IRIS_LOG_LEVEL, expands formatter output, and introduces _log_rank() helper. |
| iris/host/tracing/kernel_artifacts.py | Adds DEBUG tracing for kernel launches via _log_rank(). |
| iris/host/platform/hip.py | Logs HIP/CUDA errors and DMA-BUF export at ERROR/DEBUG. |
| iris/host/memory/symmetric_heap.py | Adds INFO/DEBUG lifecycle tracing for heap init, allocations, and peer refresh. |
| iris/host/memory/allocators/vmem_allocator.py | Adds INFO init and DEBUG allocation tracing. |
| iris/host/memory/allocators/torch_allocator.py | Adds INFO init, DEBUG allocation tracing, and ERROR OOM log. |
| iris/host/iris.py | Adds INFO init log and DEBUG barrier trace. |
| iris/host/distributed/helpers.py | Adds DEBUG tracing for allgather/broadcast/barrier helpers. |
| iris/host/distributed/fd_passing.py | Adds DEBUG tracing for FD infrastructure setup. |
| iris/ccl/all_reduce.py | Adds DEBUG tracing for all-reduce preamble and main call. |
| iris/ccl/all_gather.py | Adds DEBUG tracing for all-gather entry. |
| iris/ccl/all_to_all.py | Adds DEBUG tracing for all-to-all entry. |
| iris/ccl/reduce_scatter.py | Adds DEBUG tracing for reduce-scatter entry. |
| iris/ops/matmul_all_reduce.py | Adds DEBUG tracing for matmul all-reduce entry. |
| iris/ops/matmul_all_gather.py | Adds DEBUG tracing for matmul all-gather entry. |
| iris/ops/matmul_reduce_scatter.py | Adds DEBUG tracing for matmul reduce-scatter entry. |
| iris/ops/all_gather_matmul.py | Adds DEBUG tracing for fused all-gather matmul entry. |
| rank = getattr(record, "iris_rank", "?") | ||
| num_ranks = getattr(record, "iris_num_ranks", "?") | ||
| ts = self.formatTime(record, "%H:%M:%S") | ||
| level = record.levelname | ||
| module = record.module | ||
| if module: | ||
| return f"{ts} {level:<5s} [Iris] [{rank}/{num_ranks}] [{module}] {record.getMessage()}" | ||
| return f"{ts} {level:<5s} [Iris] [{rank}/{num_ranks}] {record.getMessage()}" |
There was a problem hiding this comment.
The formatter always includes the [{module}] bracket whenever record.module is present. This likely affects user-facing logs too (e.g., ctx.info()), leaking the user module/script name, which conflicts with the PR description (“Show [module] bracket only for internal iris logs, not user-facing ctx.info() calls”). A concrete fix is to conditionally include [{module}] only when a dedicated flag (e.g., record.iris_internal = True) is set by _log_rank(), and omit it for regular logger.info(...) calls.
| ) | ||
| if rank is not None: | ||
| record.iris_rank = rank | ||
| record.iris_num_ranks = num_ranks |
There was a problem hiding this comment.
_log_rank() sets record.iris_num_ranks = num_ranks whenever rank is provided, even if num_ranks is None. Since IrisFormatter uses getattr(record, \"iris_num_ranks\", \"?\"), the presence of the attribute causes the output to render None instead of ? (e.g., calls like iris_launch(..., rank=rank) that don’t pass num_ranks). Fix by only setting iris_num_ranks when num_ranks is not None, or normalizing it to \"?\"/-1 when missing.
| record.iris_num_ranks = num_ranks | |
| if num_ranks is not None: | |
| record.iris_num_ranks = num_ranks |
| _log_rank( | ||
| logging.INFO, | ||
| "refresh_peer_access: done heap_bases=[%s]", | ||
| ", ".join(f"0x{int(self.heap_bases[r].item()):x}" for r in range(min(3, self.num_ranks))), | ||
| rank=self.cur_rank, | ||
| num_ranks=self.num_ranks, | ||
| ) |
There was a problem hiding this comment.
The third argument to _log_rank() is eagerly computed before the function call, which forces up to 3 .item() device-to-host synchronizations on self.heap_bases even when INFO logging is later disabled (and even when it’s enabled, it adds synchronization overhead). To keep logging truly low-overhead, avoid precomputing the joined string; instead guard the computation behind a level check (e.g., if logger.isEnabledFor(logging.INFO): ...) or pass cheaper-to-compute arguments (e.g., cached CPU copies) so .item() isn’t executed as part of argument evaluation.
| _log_rank( | |
| logging.INFO, | |
| "refresh_peer_access: done heap_bases=[%s]", | |
| ", ".join(f"0x{int(self.heap_bases[r].item()):x}" for r in range(min(3, self.num_ranks))), | |
| rank=self.cur_rank, | |
| num_ranks=self.num_ranks, | |
| ) | |
| if logging.getLogger(__name__).isEnabledFor(logging.INFO): | |
| heap_bases_preview = ", ".join( | |
| f"0x{int(self.heap_bases[r].item()):x}" for r in range(min(3, self.num_ranks)) | |
| ) | |
| _log_rank( | |
| logging.INFO, | |
| "refresh_peer_access: done heap_bases=[%s]", | |
| heap_bases_preview, | |
| rank=self.cur_rank, | |
| num_ranks=self.num_ranks, | |
| ) |
| _log_rank( | ||
| logging.INFO, | ||
| "refresh_peer_access: start rank=%d num_ranks=%d", | ||
| self.cur_rank, | ||
| self.num_ranks, | ||
| rank=self.cur_rank, | ||
| num_ranks=self.num_ranks, | ||
| ) |
There was a problem hiding this comment.
refresh_peer_access() logs at INFO on every invocation (“start” and also “done” later). In this file, allocate() calls self.refresh_peer_access() for every allocation (see context excerpt), so default INFO mode will emit repeated lifecycle lines and contradict the PR description that default INFO produces only ~4–5 init lines. Consider lowering these to DEBUG, or logging INFO only on first initialization / when topology changes, to keep default output low-noise.
| self._log_with_rank( | ||
| logging.INFO, | ||
| f"init: heap_size={heap_size / (1 << 30):.1f}GB rank={cur_rank}/{num_ranks} allocator={allocator_type}", | ||
| ) |
There was a problem hiding this comment.
This uses an f-string, which eagerly formats the message even if INFO logging is disabled (and the PR description mentions “lazy formatting”). Prefer passing a format string plus arguments (or extending _log_with_rank to accept msg, *args like _log_rank) so formatting work is skipped when the level isn’t enabled.
| self._log_with_rank( | |
| logging.INFO, | |
| f"init: heap_size={heap_size / (1 << 30):.1f}GB rank={cur_rank}/{num_ranks} allocator={allocator_type}", | |
| ) | |
| if logger.isEnabledFor(logging.INFO): | |
| self._log_with_rank( | |
| logging.INFO, | |
| f"init: heap_size={heap_size / (1 << 30):.1f}GB rank={cur_rank}/{num_ranks} allocator={allocator_type}", | |
| ) |
Summary
IRIS_LOG_LEVELenv var (DEBUG/INFO/WARNING/ERROR) to control iris log verbosity_log_rank()helper for rank-aware logging with lazy formatting[module]bracket only for internal iris logs, not user-facingctx.info()callsAt default INFO level, only 4-5 lifecycle lines appear during init. At DEBUG, full tracing of every CCL call, kernel launch, allocation, and barrier.
Test plan
ruff check iris/ && ruff format --check iris/IRIS_LOG_LEVEL=DEBUG torchrun --nproc_per_node=4 tests/run_tests_distributed.py tests/ccl/test_all_reduce.py -v— verify debug output with[Iris]prefix🤖 Generated with Claude Code