-
Notifications
You must be signed in to change notification settings - Fork 152
Remove AWS_OFI_NCCL_VERSION #911
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -38,7 +38,6 @@ The NCCL tests are packaged in a container. | |
| > |`CUDA_VERSION` | `12.8.1` | | | ||
| > |`GDRCOPY_VERSION` | `v2.5.1` | [link](https://github.com/NVIDIA/gdrcopy) | | ||
| > |`EFA_INSTALLER_VERSION`| `1.43.2` | [link](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start.html#efa-start-enable) | | ||
| > |`AWS_OFI_NCCL_VERSION` | `v1.16.3` | [link](https://github.com/aws/aws-ofi-nccl) | | ||
| > |`NCCL_VERSION` | `v2.27.7-1` | [link](https://github.com/NVIDIA/nccl) | | ||
| > |`NCCL_TESTS_VERSION` | `v2.16.9` | [link](https://github.com/NVIDIA/nccl-tests) | | ||
|
|
||
|
|
@@ -47,10 +46,9 @@ You must pick each version of the library and set them as variables before proce | |
| ```bash | ||
| GDRCOPY_VERSION=v2.5.1 | ||
| EFA_INSTALLER_VERSION=1.43.2 | ||
| AWS_OFI_NCCL_VERSION=v1.16.3 | ||
| NCCL_VERSION=v2.27.7-1 | ||
| NCCL_TESTS_VERSION=v2.16.9 | ||
| TAG="efa${EFA_INSTALLER_VERSION}-ofi${AWS_OFI_NCCL_VERSION}-nccl${NCCL_VERSION}-tests${NCCL_TESTS_VERSION}" | ||
| TAG="efa${EFA_INSTALLER_VERSION}-nccl${NCCL_VERSION}-tests${NCCL_TESTS_VERSION}" | ||
| CONTAINER_IMAGE_NAME_TAG="nccl-tests:${TAG}" | ||
| ``` | ||
|
|
||
|
|
@@ -62,7 +60,6 @@ If you wish to build the containar image by yourself, follow this section. Alter | |
| ```bash | ||
| docker build -f nccl-tests.Dockerfile \ | ||
| --build-arg="EFA_INSTALLER_VERSION=${EFA_INSTALLER_VERSION}" \ | ||
| --build-arg="AWS_OFI_NCCL_VERSION=${AWS_OFI_NCCL_VERSION}" \ | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Either comment out, or add a comment below that shows how folks with older efa versions can build with ofi Nccl installation. just for the short term, until we get a couple more efa installer versions. |
||
| --build-arg="NCCL_VERSION=${NCCL_VERSION}" \ | ||
| --build-arg="NCCL_TESTS_VERSION=${NCCL_TESTS_VERSION}" \ | ||
| -t ${CONTAINER_IMAGE_NAME_TAG} \ | ||
|
|
@@ -262,7 +259,7 @@ To change the type of collective to test, modify the line with `srun` in the fil | |
| kubectl logs -f $(kubectl get pods | grep launcher | cut -d ' ' -f 1) | ||
| ``` | ||
|
|
||
| The following is an example exerpt from the logs of a NCCL all_reduce_perf test, executed on a cluster with two p5.48xlarge instances (using EFA_INSTALLER_VERSION=1.28.0, AWS_OFI_NCCL_VERSION=v1.7.3-aws, NCCL_TESTS_VERSION=master, ARG NCCL_VERSION=2.18.5): | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same here |
||
| The following is an example exerpt from the logs of a NCCL all_reduce_perf test, executed on a cluster with two p5.48xlarge instances (using EFA_INSTALLER_VERSION=1.28.0, NCCL_TESTS_VERSION=master, ARG NCCL_VERSION=2.18.5): | ||
|
|
||
| ```log | ||
| [1,0]<stdout>:# out-of-place in-place | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -5,7 +5,6 @@ FROM nvcr.io/nvidia/cuda:${CUDA_VERSION}-devel-ubuntu22.04 | |
|
|
||
| ARG GDRCOPY_VERSION=v2.5.1 | ||
| ARG EFA_INSTALLER_VERSION=1.43.2 | ||
| ARG AWS_OFI_NCCL_VERSION=v1.16.3 | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same here |
||
| ARG NCCL_VERSION=v2.27.7-1 | ||
| ARG NCCL_TESTS_VERSION=v2.16.9 | ||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you still add a line to the readme that shows folks how they can install OFI NCCL version (and why this was removed — because it’s now bundled in with efa installation)?