Skip to content

Fix bugs#139

Open
tastelikefeet wants to merge 2 commits intomodelscope:mainfrom
tastelikefeet:fix/0404-1
Open

Fix bugs#139
tastelikefeet wants to merge 2 commits intomodelscope:mainfrom
tastelikefeet:fix/0404-1

Conversation

@tastelikefeet
Copy link
Copy Markdown
Collaborator

PR type

  • Bug Fix
  • New Feature
  • Document Updates
  • More Models or Datasets Support

PR information

Write the detail information belongs to this PR.

Experiment results

Paste your experiment result here(if needed).

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a GRPO training script for the GSM8K dataset using Ray, incorporating a brevity reward mechanism to promote concise reasoning. It also adds a patch to the Megatron PEFT implementation to disable merge checks. Key feedback includes correcting the log probability extraction logic for vLLM outputs, moving initialization calls into the main function to prevent side effects, and optimizing performance by instantiating reward functions outside the training loop. Additionally, the commented-out LoRA adapter configuration should be addressed to ensure the intended training behavior.

for sample_response in sample_responses:
for sequence in sample_response.sequences:
all_input_data.append(sequence.new_input_feature)
all_old_logps.append([logprob[0][1] for logprob in sequence.logprobs])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The extraction logic [logprob[0][1] for logprob in sequence.logprobs] appears to be incorrect for the standard vLLM output format. In vLLM, sequence.logprobs is typically a list of dictionaries mapping token IDs to Logprob objects. Accessing logprob[0] would perform a key lookup for token ID 0, and [1] would fail on the resulting Logprob object. This should be updated to correctly extract the log probability of the sampled token.

Suggested change
all_old_logps.append([logprob[0][1] for logprob in sequence.logprobs])
all_old_logps.append([list(lp.values())[0].logprob for lp in sequence.logprobs if lp])

Comment on lines +54 to +56
swanlab.init(
project='twinkle',
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The swanlab.init call is placed at the module level. This can lead to unexpected side effects when the script is imported or used in different contexts. It is generally better to move initialization logic inside the main() function or a dedicated setup function.

Comment on lines +105 to +106
accuracy_reward_fn = GSM8KAccuracyReward()
brevity_reward_fn = GSM8KBrevityReward()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Instantiating the reward functions GSM8KAccuracyReward and GSM8KBrevityReward inside compute_rewards is inefficient because this function is called in every training iteration. These should be instantiated once outside the training loop and passed in or reused to avoid unnecessary overhead.

Suggested change
accuracy_reward_fn = GSM8KAccuracyReward()
brevity_reward_fn = GSM8KBrevityReward()
# These should ideally be instantiated once outside the loop
accuracy_reward_fn = kwargs.get('accuracy_reward_fn', GSM8KAccuracyReward())
brevity_reward_fn = kwargs.get('brevity_reward_fn', GSM8KBrevityReward())

remote_group='model',
)

# model.add_adapter_to_model(ADAPTER_NAME, lora_config, gradient_accumulation_steps=1)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The call to add_adapter_to_model is currently commented out. Since the script defines LORA_RANK and lora_config, it seems intended for LoRA training. If this remains commented out, the model will perform full parameter fine-tuning, which might lead to Out-Of-Memory (OOM) issues or inconsistent behavior given the sampler configuration.

Suggested change
# model.add_adapter_to_model(ADAPTER_NAME, lora_config, gradient_accumulation_steps=1)
model.add_adapter_to_model(ADAPTER_NAME, lora_config, gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant