PySDK Version
Describe the bug
When using ModelTrainer with SourceCode on Windows, the SDK internally generates sm_train.sh with CRLF (\r\n) line endings. This causes the training job to fail immediately when the Linux container tries to execute it.
The root cause is in model_trainer.py in the _prepare_train_script method:
with open(os.path.join(tmp_dir.name, TRAIN_SCRIPT), "w") as f:
f.write(train_script)
To reproduce
from sagemaker.train import ModelTrainer
from sagemaker.train.configs import SourceCode, Compute, InputData, OutputDataConfig
source_code = SourceCode(
source_dir="src",
entry_script="train.py",
requirements="requirements.txt"
)
compute = Compute(
instance_type="ml.m5.xlarge",
instance_count=1
)
model_trainer = ModelTrainer(
training_image="",
role="",
source_code=source_code,
compute=compute,
)
train_data = InputData(channel_name="train", data_source="s3://bucket/train/")
val_data = InputData(channel_name="validation", data_source="s3://bucket/val/")
model_trainer.train(input_data_config=[train_data, val_data], wait=True)
Expected behavior
sm_train.sh should always be written with LF (\n) line endings regardless of the host OS, since it will always be executed inside a Linux container.
Error in CloudWatch Logs
/opt/ml/input/data/sm_drivers/sm_train.sh: line 1: $'\r': command not found
/opt/ml/input/data/sm_drivers/sm_train.sh: line 3: set: -#15: invalid option
set: usage: set [-abefhkmnptuvxBCHP] [-o option-name] [--] [arg ...]
/opt/ml/input/data/sm_drivers/sm_train.sh: line 6: syntax error near unexpected token $'{\r'' /opt/ml/input/data/sm_drivers/sm_train.sh: line 6: handle_error() {#15'
Proposed Fix
Current code (line in _prepare_train_script):
with open(os.path.join(tmp_dir.name, TRAIN_SCRIPT), "w") as f:
Fix — force LF line endings:
with open(os.path.join(tmp_dir.name, TRAIN_SCRIPT), "w", newline="\n") as f:
System information
A description of your system. Please provide:
- SageMaker Python SDK version:3.x
- **Framework name or algorithm **:XgBoost
- Framework version:1.7-1
- Python version:3.11.5
- CPU or GPU:CPU (ml.m5.xlarge)
- Custom Docker image (Y/N):N
Additional context
This issue affects all Windows users of the new ModelTrainer API (PySDK V3). The sm_train.sh file is generated entirely by the SDK on the client machine and is never touched by the user, making it impossible to fix without either patching the SDK or switching to the older Estimator API. The fix is a single-character change adding newline="\n" to the open() call.
PySDK Version
Describe the bug
When using
ModelTrainerwithSourceCodeon Windows, the SDK internally generatessm_train.shwith CRLF (\r\n) line endings. This causes the training job to fail immediately when the Linux container tries to execute it.The root cause is in model_trainer.py in the _prepare_train_script method:
with open(os.path.join(tmp_dir.name, TRAIN_SCRIPT), "w") as f:
f.write(train_script)
To reproduce
from sagemaker.train import ModelTrainer
from sagemaker.train.configs import SourceCode, Compute, InputData, OutputDataConfig
source_code = SourceCode(
source_dir="src",
entry_script="train.py",
requirements="requirements.txt"
)
compute = Compute(
instance_type="ml.m5.xlarge",
instance_count=1
)
model_trainer = ModelTrainer(
training_image="",
role="",
source_code=source_code,
compute=compute,
)
train_data = InputData(channel_name="train", data_source="s3://bucket/train/")
val_data = InputData(channel_name="validation", data_source="s3://bucket/val/")
model_trainer.train(input_data_config=[train_data, val_data], wait=True)
Expected behavior
sm_train.sh should always be written with LF (\n) line endings regardless of the host OS, since it will always be executed inside a Linux container.
Error in CloudWatch Logs
/opt/ml/input/data/sm_drivers/sm_train.sh: line 1: $'\r': command not found
/opt/ml/input/data/sm_drivers/sm_train.sh: line 3: set: -#15: invalid option
set: usage: set [-abefhkmnptuvxBCHP] [-o option-name] [--] [arg ...]
/opt/ml/input/data/sm_drivers/sm_train.sh: line 6: syntax error near unexpected token
$'{\r'' /opt/ml/input/data/sm_drivers/sm_train.sh: line 6:handle_error() {#15'Proposed Fix
Current code (line in _prepare_train_script):
with open(os.path.join(tmp_dir.name, TRAIN_SCRIPT), "w") as f:
Fix — force LF line endings:
with open(os.path.join(tmp_dir.name, TRAIN_SCRIPT), "w", newline="\n") as f:
System information
A description of your system. Please provide:
Additional context
This issue affects all Windows users of the new ModelTrainer API (PySDK V3). The sm_train.sh file is generated entirely by the SDK on the client machine and is never touched by the user, making it impossible to fix without either patching the SDK or switching to the older Estimator API. The fix is a single-character change adding newline="\n" to the open() call.