Skip to content
This repository was archived by the owner on Aug 15, 2025. It is now read-only.

s3_management: add a mechanism to manage third-party dependencies cache on S3#1096

Open
izaitsevfb wants to merge 1 commit intopytorch:mainfrom
izaitsevfb:s3-cache-thirdparty-deps
Open

s3_management: add a mechanism to manage third-party dependencies cache on S3#1096
izaitsevfb wants to merge 1 commit intopytorch:mainfrom
izaitsevfb:s3-cache-thirdparty-deps

Conversation

@izaitsevfb
Copy link
Copy Markdown
Contributor

@izaitsevfb izaitsevfb commented Aug 4, 2022

Partially addresses: pytorch/pytorch#75703

This PR is currently a work in progress, see below for the list of TODOS.

Problem

As described in pytorch/pytorch#75703 currently there are multiple places where build is dependent on third-party dependencies hosted on on third-party servers (which shown to be not reliable).

Some of these places are just basic wget and curl requests of third-party URLs, that can (and should be) cached in our public S3 buckets to provide a fallback for reliability.

The issue with manual uploading of such dependencies is two-fold, first, it's time consuming, second, external engineers don't have access to ossci S3 buckets.

The issue with fully automated upload is the lack of transparency, control, and security.

Proposal

This PR proposes a compromise between manual and fully automated caching of third-party dependencies.

The script (see s3_management/thirdparty_deps/manage.py) that takes a yml configuration with the list of urls with corresponding S3 keys:

- bucket: ossci-linux
  url: https://developer.download.nvidia.com/compute/redist/cudnn/v8.3.2/local_installers/11.5/cudnn-linux-x86_64-8.3.2.44_cuda11.5-archive.tar.xz
  key: cudnn/cudnn-linux-x86_64-8.3.2.44_cuda11.5-archive.tar.xz 

- bucket: ossci-linux
  url: https://developer.download.nvidia.com/...
  key: cuda/...

...

...and uploads URLs (that haven't been synced yet) to S3, making them public.

The script is intended to be invoked by CI (that has access to ossci S3 secrets).

Intended workflow

  • The yml config will be committed to one (or perhaps more) of pytorch org repos
    • the corresponding CI workflows will be added as well
  • Sync script is invoked by CI when config is pushed into master branch, or manually
    • This way PyTorch Dev Infra team can validate and approve the changes in the S3 cache
    • Adding new URLs to the S3 cache and using them elsewhere could be done within the same PR (may require manual triggering of the sync workflow)

Validation

Script contains rudimentary auto validation (currently hardcoded) of the yml config:

  • white list of bucket names
  • list of allowed key prefixes for each bucket
  • regex s3 key validation
  • url validation

But the workflow mostly relies on manual review of the PRs containing the S3 cache update.

Advantages

  • Self-documenting organization of the ossci S3 buckets (at least the third-party cache part)
    • yml config in git allows to trace the context of who and why uploaded a file to S3
  • Sharing the ability to upload to S3 with external engineers in a controlled way
  • Uploading to s3 is automated -> saves time

TODO

  • Add CI jobs that invoke the script
    • discuss which projects should include the yml config and the CI job (builder, pytorch?)
  • Extract the validation config into the yml file
  • Add documentation and reference from other parts of the documentation that require S3 upload
  • Add more validation rules (?)

Potential issues/questions

  • potential race condition when sync jobs with two versions of the conf with the same key are run in parallel
    • could happen:
      • when one sync job is triggered manually
      • when two PRs that update the same key are merged in the rapid succession
    • undefined outcome: either file version will be on S3
  • updating/removing dependencies is not supported automatically
    • currently manual deletion from S3 is required

@izaitsevfb
Copy link
Copy Markdown
Contributor Author

The next step is to add the CI job(s) that will run the script to perform the sync according to the config.

And it's not clear for me what is the best project to place the workflow and sync config in.

The options are:

  • pytorch/builder project
  • pytorch/pytorch project
  • both projects: have a separate config per project

Since there are cached dependencies in both project, it might be reasonable to have separate configs in each project, at the same time it's easier to maintain a config that is in one place.

Note: pytorch/pytorch already contains another s3-related workflow update_s3_htmls.yml.

@janeyx99 @huydhn @atalman , since you already know the context, could you please share you thoughts on that?

@huydhn
Copy link
Copy Markdown
Contributor

huydhn commented Aug 6, 2022

potential race condition when sync jobs with two versions of the conf with the same key are run in parallel

I think we could avoid this by not running the script at pull_request or push, but as a periodical job. We probably don't need sub-second sync here, do we? It also keeps things simple :)

pytorch/builder

My understand is that this repo is dedicated to building and releasing pytorch. So I feel that having the CI job here is a bit out of place.

pytorch/pytorch

We could have it here but I rather not do this because merging a PR into this repo takes a long time (without test target determination and other cool things we are planning to build ;)). IMO, we don't really need this such extensive testing capability for this change.

Thus the best place I can think of is https://github.com/pytorch/test-infra. As its name indicates, it is a collection of infrastructure components that are supporting the PyTorch CI/CD system. So we can have 2 parts like:

  1. Sync job running on https://github.com/pytorch/test-infra, keeping everything neat and up-to-date. It runs behind the scene
  2. Update the script on pytorch/pytorch to get the deps from S3

@huydhn
Copy link
Copy Markdown
Contributor

huydhn commented Aug 8, 2022

FYI, I have another prime candidate for this script besides CUDA stuffs, https://github.com/pytorch/pytorch/blob/master/.jenkins/pytorch/win-test-helpers/installation-helpers/install_miniconda3.bat#L12. This is definitely going to be useful :)

@izaitsevfb
Copy link
Copy Markdown
Contributor Author

izaitsevfb commented Aug 11, 2022

After the discussion with @seemethere, I decided to stick with the dockerised GH action approach, as it allows to have granular config per project and is more reliable (when image is built it is guaranteed to work).

The system will consist of two parts:

  1. in pytorch/builder there will be a docker image of the GH action with the sync script (this PR). The image must be built and pushed to docker repo once and it won't change often (if ever).
  2. in other projects there will be a configuration + workflow to apply the configuration

I believe this PR is ready for review.


Testing:

  • built docker image locally
  • checked different configs against test S3 bucket locally (using docker run)
    • validation
    • successful upload and ACLs (ensured that test bucket has the same ACL settings as OSSCI buckets)
    • failures
  • used act to locally verify the the workflow

@izaitsevfb izaitsevfb changed the title [WIP] s3_management: add a mechanism to manage third-party dependencies cache on S3 s3_management: add a mechanism to manage third-party dependencies cache on S3 Aug 11, 2022
Comment thread s3_management/thirdparty_deps/manage.py
Comment thread s3_management/thirdparty_deps/manage.py
Comment thread s3_management/thirdparty_deps/manage.py Outdated
Comment thread s3_management/thirdparty_deps/Dockerfile Outdated
pytorchmergebot pushed a commit to pytorch/pytorch that referenced this pull request Aug 15, 2022
For the context, see #75703, pytorch/builder#1096.

Note: depends on the docker image `pytorch/sync_s3_thirdparty_deps` from pytorch/builder#1096

Summary of additions:
* workflow config (based on pytorch/sync_s3_thirdparty_deps GH action)
* S3 mapping config (sync_s3_cache.yml)

Pull Request resolved: #83306
Approved by: https://github.com/huydhn
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants