Skip to content

Add comprehensive logging with IRIS_LOG_LEVEL support#522

Open
mawad-amd wants to merge 3 commits intomainfrom
muhaawad/iris-logging
Open

Add comprehensive logging with IRIS_LOG_LEVEL support#522
mawad-amd wants to merge 3 commits intomainfrom
muhaawad/iris-logging

Conversation

@mawad-amd
Copy link
Copy Markdown
Collaborator

Summary

  • Add IRIS_LOG_LEVEL env var (DEBUG/INFO/WARNING/ERROR) to control iris log verbosity
  • Add _log_rank() helper for rank-aware logging with lazy formatting
  • Instrument 17 files across CCL ops, symmetric heap, allocators, distributed helpers, kernel launches, and fused ops with two-tier logging (DEBUG for tracing, INFO for lifecycle events)
  • Show [module] bracket only for internal iris logs, not user-facing ctx.info() calls

At default INFO level, only 4-5 lifecycle lines appear during init. At DEBUG, full tracing of every CCL call, kernel launch, allocation, and barrier.

Test plan

  • ruff check iris/ && ruff format --check iris/
  • IRIS_LOG_LEVEL=DEBUG torchrun --nproc_per_node=4 tests/run_tests_distributed.py tests/ccl/test_all_reduce.py -v — verify debug output with [Iris] prefix
  • Default INFO level — no visible logging noise, no test regressions

🤖 Generated with Claude Code

mawad-amd and others added 3 commits April 26, 2026 22:23
Instrument 17 files with ~30 log lines covering CCL ops, symmetric heap,
allocators, distributed helpers, HIP platform, kernel launches, and fused
ops. Two-tier logging: DEBUG for entry-point tracing, INFO for lifecycle
events (init, peer access, IPC). Enhanced log format with timestamp,
level, module name, and rank. Add IRIS_LOG_LEVEL env var to control level
at import time.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
_log_rank() now captures the caller's filename via sys._getframe so the
formatter can show [module]. User-facing ctx.info()/ctx.debug() calls go
through _log_with_rank(pathname="") so the module bracket is omitted.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The distributed_allgather, distributed_broadcast_tensor, and
distributed_barrier functions were calling _log_rank without rank=
and num_ranks= kwargs, causing log lines to show ?/? instead of
the actual rank info.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 27, 2026 05:24
@mawad-amd mawad-amd requested review from BKP and neoblizz as code owners April 27, 2026 05:24
@github-actions github-actions Bot added in-progress We are working on it iris Iris project issue labels Apr 27, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds rank-aware, env-configurable logging across Iris distributed ops and runtime to improve debuggability while keeping default output low-noise.

Changes:

  • Introduces IRIS_LOG_LEVEL env override and _log_rank() helper for rank-aware internal logging.
  • Updates IrisFormatter to include timestamps/levels/rank and module information.
  • Instruments CCL ops, matmul collectives, allocators/symmetric heap, kernel launch tracing, and distributed helpers with DEBUG/INFO logs.

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
iris/host/logging/logging.py Adds IRIS_LOG_LEVEL, expands formatter output, and introduces _log_rank() helper.
iris/host/tracing/kernel_artifacts.py Adds DEBUG tracing for kernel launches via _log_rank().
iris/host/platform/hip.py Logs HIP/CUDA errors and DMA-BUF export at ERROR/DEBUG.
iris/host/memory/symmetric_heap.py Adds INFO/DEBUG lifecycle tracing for heap init, allocations, and peer refresh.
iris/host/memory/allocators/vmem_allocator.py Adds INFO init and DEBUG allocation tracing.
iris/host/memory/allocators/torch_allocator.py Adds INFO init, DEBUG allocation tracing, and ERROR OOM log.
iris/host/iris.py Adds INFO init log and DEBUG barrier trace.
iris/host/distributed/helpers.py Adds DEBUG tracing for allgather/broadcast/barrier helpers.
iris/host/distributed/fd_passing.py Adds DEBUG tracing for FD infrastructure setup.
iris/ccl/all_reduce.py Adds DEBUG tracing for all-reduce preamble and main call.
iris/ccl/all_gather.py Adds DEBUG tracing for all-gather entry.
iris/ccl/all_to_all.py Adds DEBUG tracing for all-to-all entry.
iris/ccl/reduce_scatter.py Adds DEBUG tracing for reduce-scatter entry.
iris/ops/matmul_all_reduce.py Adds DEBUG tracing for matmul all-reduce entry.
iris/ops/matmul_all_gather.py Adds DEBUG tracing for matmul all-gather entry.
iris/ops/matmul_reduce_scatter.py Adds DEBUG tracing for matmul reduce-scatter entry.
iris/ops/all_gather_matmul.py Adds DEBUG tracing for fused all-gather matmul entry.

Comment on lines +26 to +33
rank = getattr(record, "iris_rank", "?")
num_ranks = getattr(record, "iris_num_ranks", "?")
ts = self.formatTime(record, "%H:%M:%S")
level = record.levelname
module = record.module
if module:
return f"{ts} {level:<5s} [Iris] [{rank}/{num_ranks}] [{module}] {record.getMessage()}"
return f"{ts} {level:<5s} [Iris] [{rank}/{num_ranks}] {record.getMessage()}"
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The formatter always includes the [{module}] bracket whenever record.module is present. This likely affects user-facing logs too (e.g., ctx.info()), leaking the user module/script name, which conflicts with the PR description (“Show [module] bracket only for internal iris logs, not user-facing ctx.info() calls”). A concrete fix is to conditionally include [{module}] only when a dedicated flag (e.g., record.iris_internal = True) is set by _log_rank(), and omit it for regular logger.info(...) calls.

Copilot uses AI. Check for mistakes.
)
if rank is not None:
record.iris_rank = rank
record.iris_num_ranks = num_ranks
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_log_rank() sets record.iris_num_ranks = num_ranks whenever rank is provided, even if num_ranks is None. Since IrisFormatter uses getattr(record, \"iris_num_ranks\", \"?\"), the presence of the attribute causes the output to render None instead of ? (e.g., calls like iris_launch(..., rank=rank) that don’t pass num_ranks). Fix by only setting iris_num_ranks when num_ranks is not None, or normalizing it to \"?\"/-1 when missing.

Suggested change
record.iris_num_ranks = num_ranks
if num_ranks is not None:
record.iris_num_ranks = num_ranks

Copilot uses AI. Check for mistakes.
Comment on lines +312 to +318
_log_rank(
logging.INFO,
"refresh_peer_access: done heap_bases=[%s]",
", ".join(f"0x{int(self.heap_bases[r].item()):x}" for r in range(min(3, self.num_ranks))),
rank=self.cur_rank,
num_ranks=self.num_ranks,
)
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The third argument to _log_rank() is eagerly computed before the function call, which forces up to 3 .item() device-to-host synchronizations on self.heap_bases even when INFO logging is later disabled (and even when it’s enabled, it adds synchronization overhead). To keep logging truly low-overhead, avoid precomputing the joined string; instead guard the computation behind a level check (e.g., if logger.isEnabledFor(logging.INFO): ...) or pass cheaper-to-compute arguments (e.g., cached CPU copies) so .item() isn’t executed as part of argument evaluation.

Suggested change
_log_rank(
logging.INFO,
"refresh_peer_access: done heap_bases=[%s]",
", ".join(f"0x{int(self.heap_bases[r].item()):x}" for r in range(min(3, self.num_ranks))),
rank=self.cur_rank,
num_ranks=self.num_ranks,
)
if logging.getLogger(__name__).isEnabledFor(logging.INFO):
heap_bases_preview = ", ".join(
f"0x{int(self.heap_bases[r].item()):x}" for r in range(min(3, self.num_ranks))
)
_log_rank(
logging.INFO,
"refresh_peer_access: done heap_bases=[%s]",
heap_bases_preview,
rank=self.cur_rank,
num_ranks=self.num_ranks,
)

Copilot uses AI. Check for mistakes.
Comment on lines +192 to +199
_log_rank(
logging.INFO,
"refresh_peer_access: start rank=%d num_ranks=%d",
self.cur_rank,
self.num_ranks,
rank=self.cur_rank,
num_ranks=self.num_ranks,
)
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refresh_peer_access() logs at INFO on every invocation (“start” and also “done” later). In this file, allocate() calls self.refresh_peer_access() for every allocation (see context excerpt), so default INFO mode will emit repeated lifecycle lines and contradict the PR description that default INFO produces only ~4–5 init lines. Consider lowering these to DEBUG, or logging INFO only on first initialization / when topology changes, to keep default output low-noise.

Copilot uses AI. Check for mistakes.
Comment thread iris/host/iris.py
Comment on lines +108 to +111
self._log_with_rank(
logging.INFO,
f"init: heap_size={heap_size / (1 << 30):.1f}GB rank={cur_rank}/{num_ranks} allocator={allocator_type}",
)
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This uses an f-string, which eagerly formats the message even if INFO logging is disabled (and the PR description mentions “lazy formatting”). Prefer passing a format string plus arguments (or extending _log_with_rank to accept msg, *args like _log_rank) so formatting work is skipped when the level isn’t enabled.

Suggested change
self._log_with_rank(
logging.INFO,
f"init: heap_size={heap_size / (1 << 30):.1f}GB rank={cur_rank}/{num_ranks} allocator={allocator_type}",
)
if logger.isEnabledFor(logging.INFO):
self._log_with_rank(
logging.INFO,
f"init: heap_size={heap_size / (1 << 30):.1f}GB rank={cur_rank}/{num_ranks} allocator={allocator_type}",
)

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

in-progress We are working on it iris Iris project issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants