Skip to content

Windows host writes sm_train.sh with CRLF in SDK v3, causing SageMaker training job bootstrap failure ($'\r': command not found) #5897

@ARVSC-occ

Description

@ARVSC-occ

PySDK Version

  • PySDK V2 (2.x)
  • PySDK V3 (3.x)

Describe the bug
When launching a SageMaker training job from Windows using sagemaker.train.ModelTrainer (SDK v3), the generated bootstrap script sm_train.sh is written with CRLF line endings. Inside the Linux training container, bash fails to parse it and the job exits before user training code starts.

The script appears to be written in sagemaker/train/model_trainer.py with:

with open(os.path.join(tmp_dir.name, TRAIN_SCRIPT), "w") as f:

which applies platform newline conversion on Windows (\r\n).

To reproduce

  1. Use Windows host with SageMaker Python SDK v3.
  2. Run this complete script (replace ROLE_ARN and S3_INPUT):
from sagemaker.core import image_uris
from sagemaker.core.helper.session_helper import Session
from sagemaker.core.training.configs import SourceCode, Compute, InputData
from sagemaker.train import ModelTrainer

session = Session()
region = session.boto_region_name

training_image = image_uris.retrieve(
    framework="pytorch",
    region=region,
    version="2.4.0",
    py_version="py311",
    instance_type="ml.g4dn.xlarge",
    image_scope="training",
)

trainer = ModelTrainer(
    sagemaker_session=session,
    role="ROLE_ARN",
    training_image=training_image,
    source_code=SourceCode(
        source_dir=".",
        entry_script="sagemaker_entry.py",
        requirements="requirements/sagemaker_train.txt",
    ),
    compute=Compute(instance_type="ml.g4dn.xlarge", instance_count=1),
)

trainer.train(
    input_data_config=[InputData(channel_name="train", data_source="S3_INPUT")],
    wait=True,
    logs=True,
)
  1. Check CloudWatch logs for the training job.

Expected behavior
sm_train.sh should be written with LF (\n) and execute correctly in the Linux container, allowing the training entry point to start.

Screenshots or logs
Observed logs:

/opt/ml/input/data/sm_drivers/sm_train.sh: line 1: $'\r': command not found
Starting training script#015
/opt/ml/input/data/sm_drivers/sm_train.sh: line 3: set: -#015: invalid option
set: usage: set [-abefhkmnptuvxBCHP] [-o option-name] [--]
/opt/ml/input/data/sm_drivers/sm_train.sh: line 5: $'\r': command not found
/opt/ml/input/data/sm_drivers/sm_train.sh: line 6: syntax error near unexpected token `$'{\r''
/opt/ml/input/data/sm_drivers/sm_train.sh: line 6: `handle_error() {#015'

System information

  • SageMaker Python SDK version: 3.12.0
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans): PyTorch
  • Framework version: 2.4.0 (DLC)
  • Python version: 3.11
  • CPU or GPU: GPU (ml.g4dn.xlarge)
  • Custom Docker image (Y/N): N

Additional context

  • Reproduced from Windows 10 host.
  • Workaround: launching from Linux (WSL/Studio/EC2) avoids CRLF in sm_train.sh.
  • Suggested fix in SDK: force LF on script write, e.g. open(..., "w", newline="\n").

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions