Skip to content

ModelTrainer generates sm_train.sh with CRLF line endings on Windows causing training job failure #5904

@prashanthreddy31

Description

@prashanthreddy31

PySDK Version

  • PySDK V3 (3.x)

Describe the bug
When using ModelTrainer with SourceCode on Windows, the SDK internally generates sm_train.sh with CRLF (\r\n) line endings. This causes the training job to fail immediately when the Linux container tries to execute it.

The root cause is in model_trainer.py in the _prepare_train_script method:
with open(os.path.join(tmp_dir.name, TRAIN_SCRIPT), "w") as f:
f.write(train_script)

To reproduce
from sagemaker.train import ModelTrainer
from sagemaker.train.configs import SourceCode, Compute, InputData, OutputDataConfig

source_code = SourceCode(
source_dir="src",
entry_script="train.py",
requirements="requirements.txt"
)

compute = Compute(
instance_type="ml.m5.xlarge",
instance_count=1
)

model_trainer = ModelTrainer(
training_image="",
role="",
source_code=source_code,
compute=compute,
)

train_data = InputData(channel_name="train", data_source="s3://bucket/train/")
val_data = InputData(channel_name="validation", data_source="s3://bucket/val/")

model_trainer.train(input_data_config=[train_data, val_data], wait=True)

Expected behavior
sm_train.sh should always be written with LF (\n) line endings regardless of the host OS, since it will always be executed inside a Linux container.

Error in CloudWatch Logs
/opt/ml/input/data/sm_drivers/sm_train.sh: line 1: $'\r': command not found
/opt/ml/input/data/sm_drivers/sm_train.sh: line 3: set: -#15: invalid option
set: usage: set [-abefhkmnptuvxBCHP] [-o option-name] [--] [arg ...]
/opt/ml/input/data/sm_drivers/sm_train.sh: line 6: syntax error near unexpected token $'{\r'' /opt/ml/input/data/sm_drivers/sm_train.sh: line 6: handle_error() {#15'

Proposed Fix

Current code (line in _prepare_train_script):

with open(os.path.join(tmp_dir.name, TRAIN_SCRIPT), "w") as f:

Fix — force LF line endings:

with open(os.path.join(tmp_dir.name, TRAIN_SCRIPT), "w", newline="\n") as f:

System information
A description of your system. Please provide:

  • SageMaker Python SDK version:3.x
  • **Framework name or algorithm **:XgBoost
  • Framework version:1.7-1
  • Python version:3.11.5
  • CPU or GPU:CPU (ml.m5.xlarge)
  • Custom Docker image (Y/N):N

Additional context
This issue affects all Windows users of the new ModelTrainer API (PySDK V3). The sm_train.sh file is generated entirely by the SDK on the client machine and is never touched by the user, making it impossible to fix without either patching the SDK or switching to the older Estimator API. The fix is a single-character change adding newline="\n" to the open() call.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions