PySDK Version
Describe the bug
When launching a SageMaker training job from Windows using sagemaker.train.ModelTrainer (SDK v3), the generated bootstrap script sm_train.sh is written with CRLF line endings. Inside the Linux training container, bash fails to parse it and the job exits before user training code starts.
The script appears to be written in sagemaker/train/model_trainer.py with:
with open(os.path.join(tmp_dir.name, TRAIN_SCRIPT), "w") as f:
which applies platform newline conversion on Windows (\r\n).
To reproduce
- Use Windows host with SageMaker Python SDK v3.
- Run this complete script (replace
ROLE_ARN and S3_INPUT):
from sagemaker.core import image_uris
from sagemaker.core.helper.session_helper import Session
from sagemaker.core.training.configs import SourceCode, Compute, InputData
from sagemaker.train import ModelTrainer
session = Session()
region = session.boto_region_name
training_image = image_uris.retrieve(
framework="pytorch",
region=region,
version="2.4.0",
py_version="py311",
instance_type="ml.g4dn.xlarge",
image_scope="training",
)
trainer = ModelTrainer(
sagemaker_session=session,
role="ROLE_ARN",
training_image=training_image,
source_code=SourceCode(
source_dir=".",
entry_script="sagemaker_entry.py",
requirements="requirements/sagemaker_train.txt",
),
compute=Compute(instance_type="ml.g4dn.xlarge", instance_count=1),
)
trainer.train(
input_data_config=[InputData(channel_name="train", data_source="S3_INPUT")],
wait=True,
logs=True,
)
- Check CloudWatch logs for the training job.
Expected behavior
sm_train.sh should be written with LF (\n) and execute correctly in the Linux container, allowing the training entry point to start.
Screenshots or logs
Observed logs:
/opt/ml/input/data/sm_drivers/sm_train.sh: line 1: $'\r': command not found
Starting training script#015
/opt/ml/input/data/sm_drivers/sm_train.sh: line 3: set: -#015: invalid option
set: usage: set [-abefhkmnptuvxBCHP] [-o option-name] [--]
/opt/ml/input/data/sm_drivers/sm_train.sh: line 5: $'\r': command not found
/opt/ml/input/data/sm_drivers/sm_train.sh: line 6: syntax error near unexpected token `$'{\r''
/opt/ml/input/data/sm_drivers/sm_train.sh: line 6: `handle_error() {#015'
System information
- SageMaker Python SDK version: 3.12.0
- Framework name (eg. PyTorch) or algorithm (eg. KMeans): PyTorch
- Framework version: 2.4.0 (DLC)
- Python version: 3.11
- CPU or GPU: GPU (
ml.g4dn.xlarge)
- Custom Docker image (Y/N): N
Additional context
- Reproduced from Windows 10 host.
- Workaround: launching from Linux (WSL/Studio/EC2) avoids CRLF in
sm_train.sh.
- Suggested fix in SDK: force LF on script write, e.g.
open(..., "w", newline="\n").
PySDK Version
Describe the bug
When launching a SageMaker training job from Windows using
sagemaker.train.ModelTrainer(SDK v3), the generated bootstrap scriptsm_train.shis written with CRLF line endings. Inside the Linux training container, bash fails to parse it and the job exits before user training code starts.The script appears to be written in
sagemaker/train/model_trainer.pywith:with open(os.path.join(tmp_dir.name, TRAIN_SCRIPT), "w") as f:which applies platform newline conversion on Windows (
\r\n).To reproduce
ROLE_ARNandS3_INPUT):Expected behavior
sm_train.shshould be written with LF (\n) and execute correctly in the Linux container, allowing the training entry point to start.Screenshots or logs
Observed logs:
System information
ml.g4dn.xlarge)Additional context
sm_train.sh.open(..., "w", newline="\n").