Fix bugs by tastelikefeet · Pull Request #139 · modelscope/twinkle

tastelikefeet · 2026-04-04T08:16:04Z

PR type

Bug Fix
New Feature
Document Updates
More Models or Datasets Support

PR information

Write the detail information belongs to this PR.

Experiment results

Paste your experiment result here(if needed).

gemini-code-assist

Code Review

This pull request introduces a GRPO training script for the GSM8K dataset using Ray, incorporating a brevity reward mechanism to promote concise reasoning. It also adds a patch to the Megatron PEFT implementation to disable merge checks. Key feedback includes correcting the log probability extraction logic for vLLM outputs, moving initialization calls into the main function to prevent side effects, and optimizing performance by instantiating reward functions outside the training loop. Additionally, the commented-out LoRA adapter configuration should be addressed to ensure the intended training behavior.

gemini-code-assist · 2026-04-04T08:17:19Z

cookbook/rl/short_math_grpo.py

+        for sample_response in sample_responses:
+            for sequence in sample_response.sequences:
+                all_input_data.append(sequence.new_input_feature)
+                all_old_logps.append([logprob[0][1] for logprob in sequence.logprobs])


The extraction logic [logprob[0][1] for logprob in sequence.logprobs] appears to be incorrect for the standard vLLM output format. In vLLM, sequence.logprobs is typically a list of dictionaries mapping token IDs to Logprob objects. Accessing logprob[0] would perform a key lookup for token ID 0, and [1] would fail on the resulting Logprob object. This should be updated to correctly extract the log probability of the sampled token.

Suggested change

all_old_logps.append([logprob[0][1] for logprob in sequence.logprobs])

all_old_logps.append([list(lp.values())[0].logprob for lp in sequence.logprobs if lp])

gemini-code-assist · 2026-04-04T08:17:19Z

cookbook/rl/short_math_grpo.py

+swanlab.init(
+    project='twinkle',
+)


The swanlab.init call is placed at the module level. This can lead to unexpected side effects when the script is imported or used in different contexts. It is generally better to move initialization logic inside the main() function or a dedicated setup function.

gemini-code-assist · 2026-04-04T08:17:19Z

cookbook/rl/short_math_grpo.py

+    accuracy_reward_fn = GSM8KAccuracyReward()
+    brevity_reward_fn = GSM8KBrevityReward()


Instantiating the reward functions GSM8KAccuracyReward and GSM8KBrevityReward inside compute_rewards is inefficient because this function is called in every training iteration. These should be instantiated once outside the training loop and passed in or reused to avoid unnecessary overhead.

Suggested change

accuracy_reward_fn = GSM8KAccuracyReward()

brevity_reward_fn = GSM8KBrevityReward()

# These should ideally be instantiated once outside the loop

accuracy_reward_fn = kwargs.get('accuracy_reward_fn', GSM8KAccuracyReward())

brevity_reward_fn = kwargs.get('brevity_reward_fn', GSM8KBrevityReward())

gemini-code-assist · 2026-04-04T08:17:19Z

cookbook/rl/short_math_grpo.py

+            remote_group='model',
+        )
+
+    # model.add_adapter_to_model(ADAPTER_NAME, lora_config, gradient_accumulation_steps=1)


The call to add_adapter_to_model is currently commented out. Since the script defines LORA_RANK and lora_config, it seems intended for LoRA training. If this remains commented out, the model will perform full parameter fine-tuning, which might lead to Out-Of-Memory (OOM) issues or inconsistent behavior given the sampler configuration.

Suggested change

# model.add_adapter_to_model(ADAPTER_NAME, lora_config, gradient_accumulation_steps=1)

model.add_adapter_to_model(ADAPTER_NAME, lora_config, gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS)

wip

1e4f53e

gemini-code-assist bot reviewed Apr 4, 2026

View reviewed changes

wip

d07d01b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix bugs#139

Fix bugs#139
tastelikefeet wants to merge 2 commits intomodelscope:mainfrom
tastelikefeet:fix/0404-1

tastelikefeet commented Apr 4, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 4, 2026

Uh oh!

gemini-code-assist bot Apr 4, 2026

Uh oh!

gemini-code-assist bot Apr 4, 2026

Uh oh!

gemini-code-assist bot Apr 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	all_old_logps.append([logprob[0][1] for logprob in sequence.logprobs])
	all_old_logps.append([list(lp.values())[0].logprob for lp in sequence.logprobs if lp])

		accuracy_reward_fn = GSM8KAccuracyReward()
		brevity_reward_fn = GSM8KBrevityReward()

-    accuracy_reward_fn = GSM8KAccuracyReward()
-    brevity_reward_fn = GSM8KBrevityReward()
+    # These should ideally be instantiated once outside the loop
+    accuracy_reward_fn = kwargs.get('accuracy_reward_fn', GSM8KAccuracyReward())
+    brevity_reward_fn = kwargs.get('brevity_reward_fn', GSM8KBrevityReward())

	# model.add_adapter_to_model(ADAPTER_NAME, lora_config, gradient_accumulation_steps=1)
	model.add_adapter_to_model(ADAPTER_NAME, lora_config, gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS)

Conversation

tastelikefeet commented Apr 4, 2026

PR type

PR information

Experiment results

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant