s3_management: add a mechanism to manage third-party dependencies cache on S3#1096
s3_management: add a mechanism to manage third-party dependencies cache on S3#1096izaitsevfb wants to merge 1 commit intopytorch:mainfrom
Conversation
|
The next step is to add the CI job(s) that will run the script to perform the sync according to the config. And it's not clear for me what is the best project to place the workflow and sync config in. The options are:
Since there are cached dependencies in both project, it might be reasonable to have separate configs in each project, at the same time it's easier to maintain a config that is in one place. Note: @janeyx99 @huydhn @atalman , since you already know the context, could you please share you thoughts on that? |
I think we could avoid this by not running the script at pull_request or push, but as a periodical job. We probably don't need sub-second sync here, do we? It also keeps things simple :)
My understand is that this repo is dedicated to building and releasing pytorch. So I feel that having the CI job here is a bit out of place.
We could have it here but I rather not do this because merging a PR into this repo takes a long time (without test target determination and other cool things we are planning to build ;)). IMO, we don't really need this such extensive testing capability for this change. Thus the best place I can think of is https://github.com/pytorch/test-infra. As its name indicates, it is a collection of infrastructure components that are supporting the PyTorch CI/CD system. So we can have 2 parts like:
|
|
FYI, I have another prime candidate for this script besides CUDA stuffs, https://github.com/pytorch/pytorch/blob/master/.jenkins/pytorch/win-test-helpers/installation-helpers/install_miniconda3.bat#L12. This is definitely going to be useful :) |
b5728f6 to
db18450
Compare
|
After the discussion with @seemethere, I decided to stick with the dockerised GH action approach, as it allows to have granular config per project and is more reliable (when image is built it is guaranteed to work). The system will consist of two parts:
I believe this PR is ready for review. Testing:
|
db18450 to
a35e46c
Compare
For the context, see #75703, pytorch/builder#1096. Note: depends on the docker image `pytorch/sync_s3_thirdparty_deps` from pytorch/builder#1096 Summary of additions: * workflow config (based on pytorch/sync_s3_thirdparty_deps GH action) * S3 mapping config (sync_s3_cache.yml) Pull Request resolved: #83306 Approved by: https://github.com/huydhn
Partially addresses: pytorch/pytorch#75703
This PR is currently a work in progress, see below for the list of TODOS.
Problem
As described in pytorch/pytorch#75703 currently there are multiple places where build is dependent on third-party dependencies hosted on on third-party servers (which shown to be not reliable).
Some of these places are just basic
wgetandcurlrequests of third-party URLs, that can (and should be) cached in our public S3 buckets to provide a fallback for reliability.The issue with manual uploading of such dependencies is two-fold, first, it's time consuming, second, external engineers don't have access to
ossciS3 buckets.The issue with fully automated upload is the lack of transparency, control, and security.
Proposal
This PR proposes a compromise between manual and fully automated caching of third-party dependencies.
The script (see
s3_management/thirdparty_deps/manage.py) that takes aymlconfiguration with the list of urls with corresponding S3 keys:...and uploads URLs (that haven't been synced yet) to S3, making them public.
The script is intended to be invoked by CI (that has access to
ossciS3 secrets).Intended workflow
masterbranch, or manuallyValidation
Script contains rudimentary auto validation (currently hardcoded) of the yml config:
But the workflow mostly relies on manual review of the PRs containing the S3 cache update.
Advantages
ossciS3 buckets (at least the third-party cache part)TODO
Potential issues/questions