s3_management: add a mechanism to manage third-party dependencies cache on S3 by izaitsevfb · Pull Request #1096 · pytorch/builder

izaitsevfb · 2022-08-04T19:45:10Z

Partially addresses: pytorch/pytorch#75703

This PR is currently a work in progress, see below for the list of TODOS.

Problem

As described in pytorch/pytorch#75703 currently there are multiple places where build is dependent on third-party dependencies hosted on on third-party servers (which shown to be not reliable).

Some of these places are just basic wget and curl requests of third-party URLs, that can (and should be) cached in our public S3 buckets to provide a fallback for reliability.

The issue with manual uploading of such dependencies is two-fold, first, it's time consuming, second, external engineers don't have access to ossci S3 buckets.

The issue with fully automated upload is the lack of transparency, control, and security.

Proposal

This PR proposes a compromise between manual and fully automated caching of third-party dependencies.

The script (see s3_management/thirdparty_deps/manage.py) that takes a yml configuration with the list of urls with corresponding S3 keys:

- bucket: ossci-linux
  url: https://developer.download.nvidia.com/compute/redist/cudnn/v8.3.2/local_installers/11.5/cudnn-linux-x86_64-8.3.2.44_cuda11.5-archive.tar.xz
  key: cudnn/cudnn-linux-x86_64-8.3.2.44_cuda11.5-archive.tar.xz 

- bucket: ossci-linux
  url: https://developer.download.nvidia.com/...
  key: cuda/...

...

...and uploads URLs (that haven't been synced yet) to S3, making them public.

The script is intended to be invoked by CI (that has access to ossci S3 secrets).

Intended workflow

The yml config will be committed to one (or perhaps more) of pytorch org repos
- the corresponding CI workflows will be added as well
Sync script is invoked by CI when config is pushed into master branch, or manually
- This way PyTorch Dev Infra team can validate and approve the changes in the S3 cache
- Adding new URLs to the S3 cache and using them elsewhere could be done within the same PR (may require manual triggering of the sync workflow)

Validation

Script contains rudimentary auto validation (currently hardcoded) of the yml config:

white list of bucket names
list of allowed key prefixes for each bucket
regex s3 key validation
url validation

But the workflow mostly relies on manual review of the PRs containing the S3 cache update.

Advantages

Self-documenting organization of the ossci S3 buckets (at least the third-party cache part)
- yml config in git allows to trace the context of who and why uploaded a file to S3
Sharing the ability to upload to S3 with external engineers in a controlled way
Uploading to s3 is automated -> saves time

TODO

Add CI jobs that invoke the script
- discuss which projects should include the yml config and the CI job (builder, pytorch?)
Extract the validation config into the yml file
Add documentation and reference from other parts of the documentation that require S3 upload
Add more validation rules (?)

Potential issues/questions

potential race condition when sync jobs with two versions of the conf with the same key are run in parallel
- could happen:
  - when one sync job is triggered manually
  - when two PRs that update the same key are merged in the rapid succession
- undefined outcome: either file version will be on S3
updating/removing dependencies is not supported automatically
- currently manual deletion from S3 is required

izaitsevfb · 2022-08-04T21:04:59Z

The next step is to add the CI job(s) that will run the script to perform the sync according to the config.

And it's not clear for me what is the best project to place the workflow and sync config in.

The options are:

pytorch/builder project
pytorch/pytorch project
both projects: have a separate config per project

Since there are cached dependencies in both project, it might be reasonable to have separate configs in each project, at the same time it's easier to maintain a config that is in one place.

Note: pytorch/pytorch already contains another s3-related workflow update_s3_htmls.yml.

@janeyx99 @huydhn @atalman , since you already know the context, could you please share you thoughts on that?

huydhn · 2022-08-06T20:52:31Z

potential race condition when sync jobs with two versions of the conf with the same key are run in parallel

I think we could avoid this by not running the script at pull_request or push, but as a periodical job. We probably don't need sub-second sync here, do we? It also keeps things simple :)

pytorch/builder

My understand is that this repo is dedicated to building and releasing pytorch. So I feel that having the CI job here is a bit out of place.

pytorch/pytorch

We could have it here but I rather not do this because merging a PR into this repo takes a long time (without test target determination and other cool things we are planning to build ;)). IMO, we don't really need this such extensive testing capability for this change.

Thus the best place I can think of is https://github.com/pytorch/test-infra. As its name indicates, it is a collection of infrastructure components that are supporting the PyTorch CI/CD system. So we can have 2 parts like:

Sync job running on https://github.com/pytorch/test-infra, keeping everything neat and up-to-date. It runs behind the scene
Update the script on pytorch/pytorch to get the deps from S3

huydhn · 2022-08-08T02:46:06Z

FYI, I have another prime candidate for this script besides CUDA stuffs, https://github.com/pytorch/pytorch/blob/master/.jenkins/pytorch/win-test-helpers/installation-helpers/install_miniconda3.bat#L12. This is definitely going to be useful :)

izaitsevfb · 2022-08-11T22:43:49Z

After the discussion with @seemethere, I decided to stick with the dockerised GH action approach, as it allows to have granular config per project and is more reliable (when image is built it is guaranteed to work).

The system will consist of two parts:

in pytorch/builder there will be a docker image of the GH action with the sync script (this PR). The image must be built and pushed to docker repo once and it won't change often (if ever).
in other projects there will be a configuration + workflow to apply the configuration
- See an example of the new workflow and configuration in this PR: Add a workflow to cache third party dependencies on S3 pytorch#83306
- Similar workflows and configs could be added to other projects (beside pytorch/pytorch) as needed to cache files on S3 that are specific to these projects.

I believe this PR is ready for review.

Testing:

built docker image locally
checked different configs against test S3 bucket locally (using docker run)
- validation
- successful upload and ACLs (ensured that test bucket has the same ACL settings as OSSCI buckets)
- failures
used act to locally verify the the workflow

…he on S3 (pytorch/pytorch#75703)

For the context, see #75703, pytorch/builder#1096. Note: depends on the docker image `pytorch/sync_s3_thirdparty_deps` from pytorch/builder#1096 Summary of additions: * workflow config (based on pytorch/sync_s3_thirdparty_deps GH action) * S3 mapping config (sync_s3_cache.yml) Pull Request resolved: #83306 Approved by: https://github.com/huydhn

facebook-github-bot added the cla signed label Aug 4, 2022

izaitsevfb self-assigned this Aug 4, 2022

izaitsevfb force-pushed the s3-cache-thirdparty-deps branch 2 times, most recently from b5728f6 to db18450 Compare August 11, 2022 03:46

izaitsevfb mentioned this pull request Aug 11, 2022

Add a workflow to cache third party dependencies on S3 pytorch/pytorch#83306

Closed

izaitsevfb changed the title ~~[WIP] s3_management: add a mechanism to manage third-party dependencies cache on S3~~ s3_management: add a mechanism to manage third-party dependencies cache on S3 Aug 11, 2022

huydhn reviewed Aug 12, 2022

View reviewed changes

Comment thread s3_management/thirdparty_deps/manage.py

huydhn reviewed Aug 12, 2022

View reviewed changes

Comment thread s3_management/thirdparty_deps/manage.py

huydhn reviewed Aug 12, 2022

View reviewed changes

Comment thread s3_management/thirdparty_deps/manage.py Outdated

huydhn reviewed Aug 12, 2022

View reviewed changes

Comment thread s3_management/thirdparty_deps/Dockerfile Outdated

s3_management: add a mechanism to manage third-party dependencies cac…

a35e46c

…he on S3 (pytorch/pytorch#75703)

izaitsevfb force-pushed the s3-cache-thirdparty-deps branch from db18450 to a35e46c Compare August 12, 2022 02:47

izaitsevfb mentioned this pull request Aug 12, 2022

S3_management: add the ability to update existing files on S3 #1103

Open

huydhn approved these changes Aug 12, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

s3_management: add a mechanism to manage third-party dependencies cache on S3#1096

s3_management: add a mechanism to manage third-party dependencies cache on S3#1096
izaitsevfb wants to merge 1 commit intopytorch:mainfrom
izaitsevfb:s3-cache-thirdparty-deps

izaitsevfb commented Aug 4, 2022 •

edited

Loading

Uh oh!

izaitsevfb commented Aug 4, 2022

Uh oh!

huydhn commented Aug 6, 2022 •

edited

Loading

Uh oh!

huydhn commented Aug 8, 2022 •

edited

Loading

Uh oh!

izaitsevfb commented Aug 11, 2022 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

izaitsevfb commented Aug 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Proposal

Intended workflow

Validation

Advantages

TODO

Potential issues/questions

Uh oh!

izaitsevfb commented Aug 4, 2022

Uh oh!

huydhn commented Aug 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

huydhn commented Aug 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

izaitsevfb commented Aug 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

izaitsevfb commented Aug 4, 2022 •

edited

Loading

huydhn commented Aug 6, 2022 •

edited

Loading

huydhn commented Aug 8, 2022 •

edited

Loading

izaitsevfb commented Aug 11, 2022 •

edited

Loading