Skip to content

Commit d4a0e90

Browse files
Yadan-WeilxningJyothirmaikottuYadan Wei
authored
SageMaker XGBoost-3.0 with Numpy 2.1.0 (#459)
* upgrade numpy to 2.0 * update setuptool * fix scikit-learn 1.4.2 error * test scipy 1.13.0 * fix typo * test scipy 1.10.0 * try numpy 2.1.0 * try numpy 2.1.0 * set numba 0.61.0 * set pyyaml 5.4.1 * set pyyaml 6.0.1 * set cryptography 45.0.5 * set requests 2.32.3 * fix image name * set panda 2.2.0 * set panda 2.2.3 * set python-dateutil==2.8.2 * set pyarrow 17.0.0 * replace rabit with dask-based api * set protobuf 5.26 * install pyarrow in container * set tbb 2022.2.0 * try mlio-py with pyarrow 17.0.0 * try mlio-py with pyarrow 17.0.0 * try install mlio * try install mlio * hack mlio * hack mlio * try protobuf 3.20.1 * set dask 2024.10.0 * set dask 2024.9.0 * set psutil 5.8.0 * update train test minor version * set matplotlib==3.6.3 * set matplotlib==3.6.3 * Trigger Build * test dask migration * test xgb migration * test xgb rabit * test xgb rapit migration * test dask expr backend migration * test rabit.tracker_print migration * test rabit and libsvm migration * test rabit and dask * test dask migratinon * test rabit * test _aggregate_predictions * recover checkpointing.py distributed.py * rabit deprecate * set env var * test distributed.py * test distributed.py * replace rabit with dask * replace rabit with collective * replace rabit with collective * replace rabit with collective * fmt * fix sklearn api deprecations * backward compatible for unit test * fmt * set matplotlib * set matplotlib==3.6.3 * set matplotlib==3.9.2 * set matplotlib==3.9.2 * fix model name * fix distributed training save model * fix distributed training save model * fix distributed training save model * fix distributed training save model * fix distributed training save model * fix distributed training save model * fix distributed training save model * fix distributed training save model * fix distributed training save model * fix distributed training save model * debug * debug master host * debug master host * debug master host * debug master host * debug master host * debug master host * debug master host * debug master host * debug master host * debug master host * debug master host * debug master host * debug master host * debug master host * debug master host * debug master host * debug master host * debug master host * debug master host * debug master host * check xgboost 2.1.1 * check xgboost 2.1.1 * check xgboost 2.1.1 * check xgboost 2.1.1 * check xgboost 2.1.1 * check xgboost 2.1.1 * check xgboost 2.1.1 * check xgboost 2.1.1 * check xgboost 2.1.1 * check xgboost 2.1.1 * check xgboost 2.1.1 * check xgboost 2.1.1 * check xgboost 2.1.1 * check xgboost 2.1.1 * check xgboost 2.1.1 * check xgboost 2.1.1 * check xgboost 2.1.0 * check xgboost 2.1.0 * check xgboost 2.1.0 * check xgboost 2.1.0 * check xgboost 2.1.0 * check xgboost 2.1.0 * check xgboost 2.1.0 * check xgboost 2.1.0 * check xgboost 2.1.0 * check xgboost 2.1.0 * check xgboost 2.1.0 * check xgboost 2.1.0 * check xgboost 2.1.0 * check xgboost 2.1.0 * check xgboost 2.1.0 * check xgboost 2.1.0 * check xgboost 2.1.0 * check xgboost 2.1.0 * check xgboost 2.1.0 * check xgboost 2.1.0 * check xgboost 2.1.0 * check xgboost 2.1.0 * check xgboost 2.1.0 * check xgboost 2.1.0 * test xgboost 3.0.5 * test xgboost 3.0.5 * test xgboost 3.0.5 * test xgboost 3.0.5 * test xgboost 3.0.5 * test xgboost 3.0.5 * test xgboost 3.0.5 * test xgboost 3.0.5 * test xgboost 3.0.5 * test xgboost 3.0.5 * test xgboost 3.0.5 * test xgboost 3.0.5 * test xgboost 3.0.5 * test xgboost 3.0.5 * test xgboost 3.0.5 * test xgboost 3.0.5 * rename 2.1.0 with 3.0.5 * test 3.0.5 * test 3.0.5 * cuda 12.0.0 * roll back to cuda 11.6.1 * fix xgboost dask * fix xgboost train * fix import * change dask imports * revert changes of xgboost import * upgrade conda * modify checksum * modify checksum with 3.10 version * fix link * fix link * fix pip install * add main * add main * add ls grpc * add ls grpc * fix xgb import * remove cd * comment npm * comment npm * change arrow version * add github node * mpdify release tag * npm install * npm install from repo root * change cuda version * change cuda version * change cuda version * revert back * add git tag * change conda version * fix checksum * increase sm debug v * install node modules in libgrpc dir * change dask v * modify dask cuda * modify pynvml * add dask 25.10 * decrease dask v * fix grpc * fix grpc * fix grpc * fix commands * fix commands * fix commands * fix commands * downgrade smdebug * downgrade smdebug * change protobuf to 3.29.5 * change protobuf to 5.29 * change protobuf to 5.29 * bump py version * bump py version * bump py version * bump py version to 3.10 * bump py version to 3.10 * proto 3.20 * proto 3.20 * proto 3.20 * proto 3.20 * increase smdebug v * increase smdebug v * increase smdebug v * increase smdebug v * comment smdebu * comment smdebu * disble flake8 * comment smdebu * add get callbacks * add get callbacks * add cuda-python and update ubuntu version * cuda python version * change back to 1804 * downgrade dask version to address dask_expr error * add dask dataframe * add dataframe * change to cuda-python 12.6 * fix typo * install nccl lib * set LD_LIBRARY_PATH to find nccl * pin pip to 25.2 to avoid get setuptools from PyPi * modify system pip version pin * rename folder * update SAGEMAKER_XGBOOST_VERSION to 3.0-5 --------- Co-authored-by: Li Ning <lninga@amazon.com> Co-authored-by: lxning <23464292+lxning@users.noreply.github.com> Co-authored-by: jkottu <jkottu@amazon.com> Co-authored-by: Yadan Wei <yadanwei@amazon.com>
1 parent ec28c4b commit d4a0e90

33 files changed

+931
-306
lines changed

docker/3.0-5/base/Dockerfile.cpu

Lines changed: 211 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,211 @@
1+
ARG UBUNTU_VERSION=20.04
2+
ARG CUDA_VERSION=12.6.3
3+
ARG IMAGE_DIGEST=c2d95c9c6ff77da41cf0f2f9e8c5088f5b4db20c16a7566b808762f05b9032ef
4+
5+
# Build stage for SQLite compilation
6+
FROM ubuntu:${UBUNTU_VERSION} as sqlite-builder
7+
RUN apt-get update && apt-get install -y --no-install-recommends \
8+
build-essential \
9+
wget \
10+
ca-certificates \
11+
&& \
12+
cd /tmp && \
13+
wget https://www.sqlite.org/2025/sqlite-autoconf-3500200.tar.gz && \
14+
tar xzf sqlite-autoconf-3500200.tar.gz && \
15+
cd sqlite-autoconf-3500200 && \
16+
./configure --prefix=/usr/local && \
17+
make && \
18+
make install && \
19+
ldconfig && \
20+
cd / && \
21+
rm -rf /tmp/sqlite-autoconf-3500200 /tmp/sqlite-autoconf-3500200.tar.gz && \
22+
apt-get clean && \
23+
rm -rf /var/lib/apt/lists/*
24+
25+
# Main image
26+
FROM nvidia/cuda:${CUDA_VERSION}-base-ubuntu${UBUNTU_VERSION}
27+
28+
ARG MINICONDA_VERSION=25.9.1
29+
ARG CONDA_CHECKSUM=04a8b03d8b0ec062d923e592201a6fd88b7247c309ef8848afb25c424c40ac39
30+
ARG CONDA_PY_VERSION=310
31+
ARG CONDA_PKG_VERSION=25.9.1
32+
ARG PYTHON_VERSION=3.10
33+
ARG PYARROW_VERSION=22.0.0
34+
ARG MLIO_VERSION=0.9.0
35+
ARG XGBOOST_VERSION=3.0.5
36+
37+
ENV DEBIAN_FRONTEND=noninteractive
38+
ENV LANG=C.UTF-8
39+
ENV LC_ALL=C.UTF-8
40+
41+
# Python won’t try to write .pyc or .pyo files on the import of source modules
42+
# Force stdin, stdout and stderr to be totally unbuffered. Good for logging
43+
ENV PYTHONDONTWRITEBYTECODE=1
44+
ENV PYTHONUNBUFFERED=1
45+
ENV PYTHONIOENCODING='utf-8'
46+
47+
RUN apt-key del 7fa2af80 && \
48+
apt-get update && apt-get install -y --no-install-recommends wget && \
49+
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-keyring_1.0-1_all.deb && \
50+
dpkg -i cuda-keyring_1.0-1_all.deb && \
51+
apt-get update && \
52+
apt-get -y upgrade && \
53+
apt-get -y install --no-install-recommends \
54+
build-essential \
55+
curl \
56+
git \
57+
jq \
58+
libatlas-base-dev \
59+
expat \
60+
nginx \
61+
openjdk-8-jdk-headless \
62+
unzip \
63+
wget \
64+
apparmor \
65+
linux-libc-dev \
66+
libxml2 \
67+
libgstreamer1.0-0 \
68+
linux-libc-dev \
69+
&& \
70+
# MLIO build dependencies
71+
# Official Ubuntu APT repositories do not contain an up-to-date version of CMake required to build MLIO.
72+
# Kitware contains the latest version of CMake.
73+
wget http://es.archive.ubuntu.com/ubuntu/pool/main/libf/libffi/libffi7_3.3-4_amd64.deb && \
74+
dpkg -i libffi7_3.3-4_amd64.deb && \
75+
apt-get -y install --no-install-recommends \
76+
apt-transport-https \
77+
ca-certificates \
78+
gnupg \
79+
software-properties-common \
80+
&& \
81+
wget -O - https://apt.kitware.com/keys/kitware-archive-latest.asc 2>/dev/null | \
82+
gpg --dearmor - | \
83+
tee /usr/share/keyrings/kitware-archive-keyring.gpg >/dev/null && \
84+
echo 'deb [signed-by=/usr/share/keyrings/kitware-archive-keyring.gpg] https://apt.kitware.com/ubuntu/ bionic main' | tee /etc/apt/sources.list.d/kitware.list >/dev/null && \
85+
apt-get update && \
86+
rm /usr/share/keyrings/kitware-archive-keyring.gpg && \
87+
apt-get install -y --no-install-recommends \
88+
autoconf \
89+
automake \
90+
build-essential \
91+
cmake \
92+
cmake-data \
93+
doxygen \
94+
kitware-archive-keyring \
95+
libcurl4-openssl-dev \
96+
libssl-dev \
97+
libtool \
98+
ninja-build \
99+
python3-dev \
100+
python3-distutils \
101+
python3-pip \
102+
zlib1g-dev \
103+
libxml2 \
104+
zstd \
105+
libsqlite3-0 \
106+
&& \
107+
# system pip, highest version is 25.0.1
108+
python3 -m pip install --upgrade "pip<=25.2" && \
109+
python3 -m pip install --upgrade certifi && \
110+
apt-get clean && \
111+
# Node.js setup
112+
mkdir -p /etc/apt/keyrings && \
113+
curl -fsSL https://deb.nodesource.com/gpgkey/nodesource-repo.gpg.key | \
114+
gpg --dearmor -o /etc/apt/keyrings/nodesource.gpg && \
115+
echo "deb [signed-by=/etc/apt/keyrings/nodesource.gpg] https://deb.nodesource.com/node_20.x nodistro main" | \
116+
tee /etc/apt/sources.list.d/nodesource.list && \
117+
apt-get update && \
118+
apt-get install -y nodejs && \
119+
npm install -g npm@latest && \
120+
rm -rf /var/lib/apt/lists/*
121+
122+
# Install conda
123+
RUN cd /tmp && \
124+
curl -L --output /tmp/Miniconda3.sh https://repo.anaconda.com/miniconda/Miniconda3-py${CONDA_PY_VERSION}_${MINICONDA_VERSION}-1-Linux-x86_64.sh && \
125+
echo "${CONDA_CHECKSUM} /tmp/Miniconda3.sh" | sha256sum -c - && \
126+
bash /tmp/Miniconda3.sh -bfp /miniconda3 && \
127+
rm /tmp/Miniconda3.sh
128+
129+
ENV PATH=/miniconda3/bin:${PATH}
130+
# Install MLIO with Apache Arrow integration
131+
RUN conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/main
132+
RUN conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/r
133+
# We could install mlio-py from conda, but it comes with extra support such as image reader that increases image size
134+
# which increases training time. We build from source to minimize the image size.
135+
RUN conda config --system --set channel_priority strict && \
136+
conda config --system --set auto_update_conda false && \
137+
conda config --system --set show_channel_urls true && \
138+
conda install -y -c conda-forge \
139+
python=${PYTHON_VERSION} \
140+
requests=2.32.3 \
141+
conda=${CONDA_PKG_VERSION} \
142+
pyarrow=${PYARROW_VERSION} \
143+
--solver=libmamba && \
144+
conda clean -afy
145+
146+
# Then handle the grpc and npm parts separately
147+
RUN git clone -b v1.65.4 https://github.com/grpc/grpc.git && \
148+
LIBGRPC_DIR=$(find /miniconda3/pkgs -name "libgrpc-*" -type d | head -n 1) && \
149+
mkdir -p ${LIBGRPC_DIR}/info/test/examples && \
150+
cp -r grpc/examples/* ${LIBGRPC_DIR}/info/test/examples/
151+
152+
RUN cd ${LIBGRPC_DIR}/info/test/examples/node && \
153+
npm cache clean --force && \
154+
npm install minimist@latest protobufjs@latest \
155+
apt-get purge -y nodejs npm && \
156+
apt-get autoremove -y && \
157+
rm -rf /etc/apt/sources.list.d/nodesource.list \
158+
/etc/apt/keyrings/nodesource.gpg \
159+
/etc/apt/sources.list.d/kitware.list && \
160+
apt-get clean && \
161+
rm -rf /var/lib/apt/lists/* && \
162+
# Continue with the rest of the build process
163+
cd /tmp && \
164+
git clone --branch v${MLIO_VERSION} https://github.com/awslabs/ml-io.git mlio && \
165+
cd mlio && \
166+
sed -i 's/find_package(Arrow 14.0.1 REQUIRED/find_package(Arrow 22.0.0 REQUIRED/g' CMakeLists.txt && \
167+
sed -i 's/pyarrow==14.0.1/pyarrow==22.0.0/g' src/mlio-py/setup.py && \
168+
build-tools/build-dependency build/third-party all && \
169+
mkdir -p build/release && \
170+
cd build/release && \
171+
cmake -GNinja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_PREFIX_PATH="$(pwd)/../third-party" ../.. && \
172+
cmake --build . && \
173+
cmake --build . --target install && \
174+
cmake -DMLIO_INCLUDE_PYTHON_EXTENSION=ON -DPYTHON_EXECUTABLE="/miniconda3/bin/python3" \
175+
-DMLIO_INCLUDE_ARROW_INTEGRATION=ON ../.. && \
176+
cmake --build . --target mlio-py && \
177+
cmake --build . --target mlio-arrow && \
178+
cd ../../src/mlio-py && \
179+
python3 setup.py bdist_wheel && \
180+
python3 -m pip install typing && \
181+
# pip will attempt to install setuptools from PyPi from pip>=25.3
182+
python3 -m pip install --upgrade pip==25.2 && \
183+
python3 -m pip install dist/*.whl && \
184+
cp -r /tmp/mlio/build/third-party/lib/libtbb* /usr/local/lib/ && \
185+
ldconfig && \
186+
rm -rf /tmp/mlio
187+
188+
# Copy compiled SQLite from builder stage
189+
COPY --from=sqlite-builder /usr/local/bin/sqlite3 /usr/local/bin/sqlite3
190+
COPY --from=sqlite-builder /usr/local/lib/libsqlite3.* /usr/local/lib/
191+
COPY --from=sqlite-builder /usr/local/include/sqlite3*.h /usr/local/include/
192+
193+
# Update library cache and ensure /usr/local/bin is in PATH
194+
RUN ldconfig && \
195+
echo "/usr/local/lib" > /etc/ld.so.conf.d/sqlite3.conf && \
196+
ldconfig
197+
198+
ENV PATH="/usr/local/bin:${PATH}"
199+
200+
RUN echo "sqlite3 "
201+
# This command will check the version and print it to the build logs
202+
RUN sqlite3 --version
203+
204+
RUN apt list --installed
205+
206+
# Install latest version of XGBoost
207+
RUN python3 -m pip install --no-cache -I xgboost==${XGBOOST_VERSION} numpy==2.1.0 pyarrow==22.0.0 pandas==2.2.3
208+
209+
# Starting from XGBoost 2.1.0, the PyPI package won't bundle with nccl
210+
# set PATH so xgboost can find libnccl
211+
ENV LD_LIBRARY_PATH=/miniconda3/lib/python3.10/site-packages/nvidia/nccl/lib:$LD_LIBRARY_PATH

docker/3.0-5/final/Dockerfile.cpu

Lines changed: 100 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,100 @@
1+
ARG SAGEMAKER_XGBOOST_VERSION=3.0-5
2+
ARG PYTHON_VERSION=3.10
3+
4+
FROM xgboost-container-base:${SAGEMAKER_XGBOOST_VERSION}-cpu-py3
5+
6+
ARG SAGEMAKER_XGBOOST_VERSION=3.0-5
7+
8+
########################
9+
# Install dependencies #
10+
########################
11+
12+
# Fix Python 3.10 compatibility for sagemaker-containers
13+
# RUN python3 -c "import sys; sys.path.insert(0, '/miniconda3/lib/python3.10/site-packages'); \
14+
# import sagemaker_containers._mapping as m; \
15+
# import collections.abc; \
16+
# setattr(collections, 'Mapping', collections.abc.Mapping); \
17+
# exec(open('/miniconda3/lib/python3.10/site-packages/sagemaker_containers/_mapping.py').read().replace('collections.Mapping', 'collections.abc.Mapping'))" || \
18+
# sed -i 's/collections\.Mapping/collections.abc.Mapping/g' /miniconda3/lib/python3.10/site-packages/sagemaker_containers/_mapping.py
19+
20+
21+
# Install smdebug from source
22+
RUN python3 -m pip install git+https://github.com/awslabs/sagemaker-debugger.git@v1.0.32
23+
24+
COPY requirements.txt /requirements.txt
25+
RUN python3 -m pip install -r /requirements.txt && rm /requirements.txt
26+
27+
RUN pip install --no-cache-dir "protobuf>=3.20.0,<=3.20.3"
28+
# ENV PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python
29+
30+
RUN sed -i 's/collections\.Mapping/collections.abc.Mapping/g' /miniconda3/lib/python3.10/site-packages/sagemaker_containers/_mapping.py
31+
32+
###########################
33+
# Copy wheel to container #
34+
###########################
35+
COPY dist/sagemaker_xgboost_container-2.0-py2.py3-none-any.whl /sagemaker_xgboost_container-1.0-py2.py3-none-any.whl
36+
RUN rm -rf /miniconda3/lib/python${PYTHON_VERSION}/site-packages/numpy-1.21.2.dist-info && \
37+
python3 -m pip install --force-reinstall PyYAML==6.0.1 && \
38+
python3 -m pip install --no-cache --no-deps /sagemaker_xgboost_container-1.0-py2.py3-none-any.whl && \
39+
python3 -m pip uninstall -y typing && \
40+
rm /sagemaker_xgboost_container-1.0-py2.py3-none-any.whl
41+
42+
##############
43+
# DMLC PATCH #
44+
##############
45+
# TODO: remove after making contributions back to xgboost for tracker.py
46+
# COPY src/sagemaker_xgboost_container/dmlc_patch/tracker.py \
47+
# /miniconda3/lib/python${PYTHON_VERSION}/site-packages/xgboost/dmlc-core/tracker/dmlc_tracker/tracker.py
48+
49+
# # Include DMLC python code in PYTHONPATH to use RabitTracker
50+
# ENV PYTHONPATH=$PYTHONPATH:/miniconda3/lib/python${PYTHON_VERSION}/site-packages/xgboost/dmlc-core/tracker
51+
52+
#######
53+
# MMS #
54+
#######
55+
# Create MMS user directory
56+
RUN useradd -m model-server
57+
RUN mkdir -p /home/model-server/tmp && chown -R model-server /home/model-server
58+
59+
# Copy MMS configs
60+
COPY docker/${SAGEMAKER_XGBOOST_VERSION}/resources/mms/config.properties.tmp /home/model-server
61+
ENV XGBOOST_MMS_CONFIG=/home/model-server/config.properties
62+
63+
# Copy execution parameters endpoint plugin for MMS
64+
RUN mkdir -p /tmp/plugins
65+
COPY docker/${SAGEMAKER_XGBOOST_VERSION}/resources/mms/endpoints-1.0.jar /tmp/plugins
66+
RUN chmod +x /tmp/plugins/endpoints-1.0.jar
67+
68+
# Create directory for models
69+
RUN mkdir -p /opt/ml/models
70+
RUN chmod +rwx /opt/ml/models
71+
72+
# Copy Dask configs
73+
RUN mkdir /etc/dask
74+
COPY docker/configs/dask_configs.yaml /etc/dask/
75+
76+
# Required label for multi-model loading
77+
LABEL com.amazonaws.sagemaker.capabilities.multi-models=true
78+
79+
#####################
80+
# Required ENV vars #
81+
#####################
82+
# Set SageMaker training environment variables
83+
ENV SM_INPUT /opt/ml/input
84+
ENV SM_INPUT_TRAINING_CONFIG_FILE $SM_INPUT/config/hyperparameters.json
85+
ENV SM_INPUT_DATA_CONFIG_FILE $SM_INPUT/config/inputdataconfig.json
86+
ENV SM_CHECKPOINT_CONFIG_FILE $SM_INPUT/config/checkpointconfig.json
87+
# See: https://github.com/dmlc/xgboost/issues/7982#issuecomment-1379390906 https://github.com/dmlc/xgboost/pull/8257
88+
ENV NCCL_SOCKET_IFNAME eth
89+
90+
91+
# Set SageMaker serving environment variables
92+
ENV SM_MODEL_DIR /opt/ml/model
93+
94+
# Set SageMaker entrypoints
95+
ENV SAGEMAKER_TRAINING_MODULE sagemaker_xgboost_container.training:main
96+
ENV SAGEMAKER_SERVING_MODULE sagemaker_xgboost_container.serving:main
97+
98+
EXPOSE 8080
99+
ENV TEMP=/home/model-server/tmp
100+
LABEL com.amazonaws.sagemaker.capabilities.accept-bind-to-port=true

0 commit comments

Comments
 (0)