Add lightweight TGS and MFU logging for training by WendaDeng · Pull Request #9465 · modelscope/ms-swift

WendaDeng · 2026-06-01T10:58:34Z

PR type

Bug Fix
[ * ] New Feature
Document Updates
More Models or Datasets Support

PR information

HF swift sft:

without DEVICE_TFLOPS: tgs is printed, MFU is not printed
with DEVICE_TFLOPS=xxx: tgs and MFU are printed when state.total_flos > 0

Megatron sft:

without DEVICE_TFLOPS: tgs is printed, MFU is not printed
with DEVICE_TFLOPS=xxx: tgs, throughput and MFU are printed

PR explanation:

Although PerfMetricsLogCallback already provides MFU logging, this change intentionally keeps the lightweight progress-log path separate. The default progress callback is responsible for writing the console progress output and logging.jsonl; user callbacks such as perf_log may run after the progress log has already been emitted, so adding MFU there does not reliably make the metric appear in the same training log entry.

This implementation only enables MFU when DEVICE_TFLOPS is explicitly provided, avoiding the extra device benchmark in perf_log.py. It also keeps the change limited to log formatting/callback code and does not patch the training step or FLOPs calculation logic.

gemini-code-assist

Code Review

This pull request introduces tracking and logging for Tokens per GPU per Second (TGS), throughput (TFLOP/s/GPU), and Model Flops Utilization (MFU) across Megatron and Hugging Face trainers, utilizing a user-specified DEVICE_TFLOPS environment variable. The code review feedback highlights several critical robustness issues where missing safety checks on potentially None or 0 values for world_size, seq_length, and logs could lead to TypeError or ZeroDivisionError crashes during training. Actionable code suggestions were provided to defensively handle these cases and prevent runtime failures.

gemini-code-assist · 2026-06-01T11:00:16Z

    train_speed = elapsed / n_steps if n_steps > 0 else 0.0
+    seq_length = getattr(args, 'seq_length', None) if args is not None else None
+    if train_speed > 0 and seq_length:
+        world_size = max(getattr(args, 'world_size', 1), 1)


If args.world_size is explicitly set to None, getattr(args, 'world_size', 1) will return None. Passing None to max() along with an integer raises a TypeError in Python 3 (e.g., TypeError: '>' not supported between instances of 'int' and 'NoneType'), which would crash the training process. Using a fallback to 1 before calling max avoids this issue.

Suggested change

world_size = max(getattr(args, 'world_size', 1), 1)

world_size = max(getattr(args, 'world_size', None) or 1, 1)

gemini-code-assist · 2026-06-01T11:00:16Z

+    def on_init_end(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, **kwargs):
+        tflops = get_env_args('DEVICE_TFLOPS', float, None)
+        if tflops is not None:
+            self.device_tflops = tflops * max(getattr(args, 'world_size', 1), 1)


If args.world_size is None, max(getattr(args, 'world_size', 1), 1) will raise a TypeError and crash the training. Using a safe fallback to 1 prevents this crash.

Suggested change

self.device_tflops = tflops * max(getattr(args, 'world_size', 1), 1)

self.device_tflops = tflops * max(getattr(args, 'world_size', None) or 1, 1)

gemini-code-assist · 2026-06-01T11:00:16Z

+        if train_speed > 0:
+            logs['tgs'] = round(
+                args.global_batch_size * args.seq_length / train_speed / args.world_size, 3)
+            if self.device_peak_tflops:
+                throughput = self._get_throughput_tflops_per_gpu(train_speed)
+                if throughput is not None:
+                    logs['throughput(TFLOP/s/GPU)'] = round(throughput, 3)
+                    logs['MFU'] = round(throughput / self.device_peak_tflops, 6)


If args.seq_length is None or args.world_size is None or 0, this calculation will raise a TypeError or ZeroDivisionError, crashing the training run. We should defensively check that seq_length is present and that world_size is a valid non-zero integer before performing the division.

Suggested change

if train_speed > 0:

logs['tgs'] = round(

args.global_batch_size * args.seq_length / train_speed / args.world_size, 3)

if self.device_peak_tflops:

throughput = self._get_throughput_tflops_per_gpu(train_speed)

if throughput is not None:

logs['throughput(TFLOP/s/GPU)'] = round(throughput, 3)

logs['MFU'] = round(throughput / self.device_peak_tflops, 6)

seq_length = getattr(args, 'seq_length', None)

world_size = getattr(args, 'world_size', None) or 1

if train_speed > 0 and seq_length:

logs['tgs'] = round(

args.global_batch_size * seq_length / train_speed / world_size, 3)

if self.device_peak_tflops:

throughput = self._get_throughput_tflops_per_gpu(train_speed)

if throughput is not None:

logs['throughput(TFLOP/s/GPU)'] = round(throughput, 3)

logs['MFU'] = round(throughput / self.device_peak_tflops, 6)

gemini-code-assist · 2026-06-01T11:00:16Z

+    def _get_throughput_tflops_per_gpu(self, train_speed):
+        if train_speed <= 0:
+            return None
+        num_flops = self._num_floating_point_operations(self.args.global_batch_size)
+        if num_flops is None:
+            return None
+        return num_flops / (train_speed * 10**12 * self.args.world_size)


If self.args.world_size is None or 0, this will raise a TypeError or ZeroDivisionError. Safely retrieving world_size with a fallback to 1 prevents potential runtime crashes.

Suggested change

def _get_throughput_tflops_per_gpu(self, train_speed):

if train_speed <= 0:

return None

num_flops = self._num_floating_point_operations(self.args.global_batch_size)

if num_flops is None:

return None

return num_flops / (train_speed * 10**12 * self.args.world_size)

def _get_throughput_tflops_per_gpu(self, train_speed):

if train_speed <= 0:

return None

world_size = getattr(self.args, 'world_size', None) or 1

num_flops = self._num_floating_point_operations(self.args.global_batch_size)

if num_flops is None:

return None

return num_flops / (train_speed * 10**12 * world_size)

gemini-code-assist · 2026-06-01T11:00:16Z

+    def _num_floating_point_operations(self, batch_size):
+        config = self.trainer.config
+        hidden_size = getattr(config, 'hidden_size', None)
+        num_layers = getattr(config, 'num_layers', None)
+        num_attention_heads = getattr(config, 'num_attention_heads', None)
+        ffn_hidden_size = getattr(config, 'ffn_hidden_size', None)
+        if None in {hidden_size, num_layers, num_attention_heads, ffn_hidden_size}:
+            return None
+
+        kv_channels = getattr(config, 'kv_channels', None) or hidden_size // num_attention_heads
+        num_query_groups = getattr(config, 'num_query_groups', None) or num_attention_heads
+        padded_vocab_size = getattr(config, 'padded_vocab_size', None) or getattr(config, 'vocab_size', None)
+        if padded_vocab_size is None:
+            return None
+
+        query_projection_size = kv_channels * num_attention_heads
+        query_projection_to_hidden_size_ratio = query_projection_size / hidden_size
+        num_experts_routed_to = 1 if getattr(config, 'num_moe_experts', None) is None else getattr(
+            config, 'moe_router_topk', 1)
+        gated_linear_multiplier = 1.5 if getattr(config, 'swiglu', False) else 1.0
+
+        return (
+            12 * batch_size * self.args.seq_length * num_layers * hidden_size * hidden_size * (
+                (1 + (num_query_groups / num_attention_heads) + (self.args.seq_length / hidden_size))
+                * query_projection_to_hidden_size_ratio
+                + (ffn_hidden_size / hidden_size) * num_experts_routed_to * gated_linear_multiplier
+                + padded_vocab_size / (2 * num_layers * hidden_size)))


If self.args.seq_length is None, this calculation will raise a TypeError. Retrieving seq_length defensively and returning None early if it is not set prevents this issue.

def _num_floating_point_operations(self, batch_size): seq_length = getattr(self.args, 'seq_length', None) if seq_length is None: return None config = self.trainer.config hidden_size = getattr(config, 'hidden_size', None) num_layers = getattr(config, 'num_layers', None) num_attention_heads = getattr(config, 'num_attention_heads', None) ffn_hidden_size = getattr(config, 'ffn_hidden_size', None) if None in {hidden_size, num_layers, num_attention_heads, ffn_hidden_size}: return None kv_channels = getattr(config, 'kv_channels', None) or hidden_size // num_attention_heads num_query_groups = getattr(config, 'num_query_groups', None) or num_attention_heads padded_vocab_size = getattr(config, 'padded_vocab_size', None) or getattr(config, 'vocab_size', None) if padded_vocab_size is None: return None query_projection_size = kv_channels * num_attention_heads query_projection_to_hidden_size_ratio = query_projection_size / hidden_size num_experts_routed_to = 1 if getattr(config, 'num_moe_experts', None) is None else getattr( config, 'moe_router_topk', 1) gated_linear_multiplier = 1.5 if getattr(config, 'swiglu', False) else 1.0 return ( 12 * batch_size * seq_length * num_layers * hidden_size * hidden_size * ( (1 + (num_query_groups / num_attention_heads) + (seq_length / hidden_size)) * query_projection_to_hidden_size_ratio + (ffn_hidden_size / hidden_size) * num_experts_routed_to * gated_linear_multiplier + padded_vocab_size / (2 * num_layers * hidden_size)))

gemini-code-assist · 2026-06-01T11:00:16Z



-def add_train_message(logs, state, start_time, start_step) -> None:
+def add_train_message(logs, state, start_time, start_step, args=None) -> None:


If logs is None, attempting to set keys on it will raise a TypeError. Adding an early return when logs is None prevents potential runtime crashes.

Suggested change

def add_train_message(logs, state, start_time, start_step, args=None) -> None:

def add_train_message(logs, state, start_time, start_step, args=None) -> None:

if logs is None:

return

gemini-code-assist · 2026-06-01T11:00:16Z

+    def on_log(self, args: TrainingArguments, state: TrainerState, control, logs=None, **kwargs):
+        total_flos = getattr(state, 'total_flos', 0) - self.start_flos
+        if self.elapsed > 0 and self.device_tflops and total_flos > 0:
+            logs['MFU'] = round(total_flos / self.elapsed / (self.device_tflops * 1e12), 6)
+        return super().on_log(args, state, control, logs, **kwargs)


If logs is None, attempting to assign logs['MFU'] will raise a TypeError. Checking that logs is not None before assigning makes the callback robust.

Suggested change

def on_log(self, args: TrainingArguments, state: TrainerState, control, logs=None, **kwargs):

total_flos = getattr(state, 'total_flos', 0) - self.start_flos

if self.elapsed > 0 and self.device_tflops and total_flos > 0:

logs['MFU'] = round(total_flos / self.elapsed / (self.device_tflops * 1e12), 6)

return super().on_log(args, state, control, logs, **kwargs)

def on_log(self, args: TrainingArguments, state: TrainerState, control, logs=None, **kwargs):

total_flos = getattr(state, 'total_flos', 0) - self.start_flos

if self.elapsed > 0 and self.device_tflops and total_flos > 0 and logs is not None:

logs['MFU'] = round(total_flos / self.elapsed / (self.device_tflops * 1e12), 6)

return super().on_log(args, state, control, logs, **kwargs)

add TGS and MFU logging for HF and Megatron training

9616411

gemini-code-assist Bot reviewed Jun 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add lightweight TGS and MFU logging for training#9465

Add lightweight TGS and MFU logging for training#9465
WendaDeng wants to merge 1 commit into
modelscope:mainfrom
WendaDeng:feat/tgs-mfu-log

WendaDeng commented Jun 1, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 1, 2026

Uh oh!

gemini-code-assist Bot Jun 1, 2026

Uh oh!

gemini-code-assist Bot Jun 1, 2026

Uh oh!

gemini-code-assist Bot Jun 1, 2026

Uh oh!

gemini-code-assist Bot Jun 1, 2026

Uh oh!

gemini-code-assist Bot Jun 1, 2026

Uh oh!

gemini-code-assist Bot Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	world_size = max(getattr(args, 'world_size', 1), 1)
	world_size = max(getattr(args, 'world_size', None) or 1, 1)

	self.device_tflops = tflops * max(getattr(args, 'world_size', 1), 1)
	self.device_tflops = tflops * max(getattr(args, 'world_size', None) or 1, 1)



		def add_train_message(logs, state, start_time, start_step) -> None:
		def add_train_message(logs, state, start_time, start_step, args=None) -> None:

Conversation

WendaDeng commented Jun 1, 2026

PR type

PR information

PR explanation:

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant