You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+123-6Lines changed: 123 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,11 +1,129 @@
1
-
## My Project
1
+
#Deploy and Manage 100x models using Amazon SageMaker Pipelines
2
2
3
-
TODO: Fill this README out!
3
+
## Overview
4
4
5
-
Be sure to:
5
+
This GitHub repository showcases the implementation of a comprehensive end-to-end MLOps pipeline using Amazon SageMaker pipelines to deploy and manage 100x machine learning models. The pipeline covers data pre-processing, model training/re-training, hyperparameter tuning, data quality check,model quality check, model registry, and model deployment. Automation of the MLOps pipeline is achieved through Continuous Integration and Continuous Deployment (CI/CD). Machine learning model for this sample code is SageMaker built-in XGBoost algorithm.
6
6
7
-
* Change the title in this README
8
-
* Edit your repository description on GitHub
7
+
## CDK Stacks
8
+
9
+
The Cloud Development Kit (CDK) is utilized to define four stacks:
10
+
11
+
1.**SM StudioSetup Stack**: Launches the SageMaker studio domain notebook with enabled SageMaker projects.
12
+
2.**SM Pipeline Stack**: Creates a code build pipeline responsible for creating and updating the SageMaker pipeline to orchastrate ML-ops. This stack defines the workflow and dependencies for building and deploying your machine learning pipeline.
13
+
3.**Start SM Pipeline Stack**: Designed to respond to new training data uploaded to the specified S3 bucket. It utilizes a Lambda function to trigger the SageMaker pipeline, ensuring that your machine learning models are updated with the latest data automatically.
14
+
4.**Inference Result Stack**: This stack creates necessary resources such as SQS (Simple Queue Service) and Lambda functions to handle the inference results from the SageMaker endpoint.
15
+
16
+
17
+
## Dataset
18
+
We use a synthetic telecommunication customer churn dataset as our sample use case. The dataset contains customer phone plan usage, account information and their churn status, whether customer would stay or leave the plan. We use SageMaker's built-in XGBoost algorithm which is suitable for this structured data. In the enhancement of the churn dataset, a new dimension column has been introduced to accommodate values denoted as DummyDim1 through DummyDimN. This approach allows for the creation and training of 100x different models, with each model associated with a distinct DummyDimension. Each row within this dataset pertains to a specific DummyDimX and will be utilized for the training of a corresponding model X. For instance, if there are X customer profiles, you can train X ML models, each associated with DummyDimension values ranging from DummyDim1 to DummyDimX. Inference dataset if already pre-processed dataset from data preproecessing_step of SageMaker pipeline.
19
+
20
+
## Architecture
21
+
22
+
The architecture relies primarily on Amazon SageMaker Pipeline to deliver an end-to-end MLOps pipeline for building, training, monitoring, and deploying machine learning models. The architecture can be divided into two main components:
1.**SageMaker Pipeline**: Handles data pre-processing, model training/tuning, monitoring, and deployment. It leverages SageMaker Studio Domain as a unified interface for model build and inference. AWS CodePipeline, triggered by an AWS CodeBuild project, automates the creation/updation of the SageMaker pipeline. When new training data is uploaded to the input-bucket, the SageMaker re-training pipeline is executed. The sequential steps include:
25
+
26
+
- Pulling new training data from S3
27
+
- Preprocessing data for training
28
+
- Conducting data quality checks
29
+
- Training/tuning the model
30
+
- Performing model quality checks
31
+
- Utilizing the Model Registry to store the model
32
+
- Deploying the model
33
+
34
+
Model deployment is managed by a Lambda step, providing flexibility for data scientists to deploy specific models with customized logic. As there are 100x model deployment so the deployment process is triggered promptly upon the addition of a new model to the model registry instead of manual approval. The LambdaStep is invoked to retrieve the recently added model from the registry and effectuate its deployment on the SageMaker endpoint and the decommissioning of the previous version of the model, ensuring a seamless transition and continuous availability of the latest model version for inference purposes.
2.**Inference**: The real-time inference process is initiated by uploading a sample inference file to the Amazon S3 bucket. Subsequently, an AWS Lambda function is triggered, responsible for fetching all records from the CSV file and dispatching them to an Amazon Simple Queue Service (SQS) queue. The SQS queue, in turn, activates a designated Lambda function, consume_messages_lambda, designed to invoke the SageMaker endpoint. This endpoint executes the necessary machine learning model for inference on the provided data, and the resulting predictions are then stored in an Amazon DynamoDB table for further analysis and retrieval. This end-to-end workflow ensures efficient and scalable real-time inference capabilities by leveraging AWS services for seamless data processing and storage.
37
+
38
+
39
+
40
+
## How to setup CDK project!
41
+
42
+
This repository includes a `project_config.json` file containing the following attributes:
43
+
44
+
-**MainStackName**: Name of the main stack.
45
+
-**SageMakerPipelineName**: Name of the SageMaker pipeline.
46
+
-**SageMakerUserProfiles**: Usernames for Sagemaker studio domain (e.g., ["User1", "User2"]).
47
+
-**USE_AMT**: Automatic Model Tuning (AMT) flag. If set to "yes", AMT will be employed for each model deployment, selecting the best-performing model.
Please refer to this configuration file for and update it as per your usecase.
59
+
60
+
This project is set up like a standard Python project. The initialization
61
+
process also creates a virtualenv within this project, stored under the `.venv`
62
+
directory. To create the virtualenv it assumes that there is a `python3`
63
+
(or `python` for Windows) executable in your path with access to the `venv`
64
+
package. If for any reason the automatic creation of the virtualenv fails,
65
+
you can create the virtualenv manually.
66
+
67
+
To manually create a virtualenv on MacOS and Linux:
68
+
69
+
```
70
+
$ python3 -m venv .venv
71
+
```
72
+
73
+
After the init process completes and the virtualenv is created, you can use the following
74
+
step to activate your virtualenv.
75
+
76
+
```
77
+
$ source .venv/bin/activate
78
+
```
79
+
80
+
If you are a Windows platform, you would activate the virtualenv like this:
81
+
82
+
```
83
+
% .venv\Scripts\activate.bat
84
+
```
85
+
86
+
Once the virtualenv is activated, you can install the required dependencies.
87
+
88
+
```
89
+
$ pip install -r requirements.txt
90
+
```
91
+
92
+
At this point you can now synthesize the CloudFormation template for this code.
93
+
94
+
```
95
+
$ cdk synth --all
96
+
```
97
+
98
+
Deploy all stacks
99
+
100
+
```
101
+
$ cdk deploy --all
102
+
```
103
+
104
+
After deploying all stacks, ensure the successful execution of the CodePipeline. On AWS Console, navigate to `Developer Tools -> CodePipeline -> Pipelines -> model-train-deploy-pipeline-modelbuild`. Verify the successful completion of the CodePipeline. If the Build phase fails, attempt to rerun the build phase.
105
+
106
+

107
+
108
+
Move the directories from the `dataset/training-dataset` folder to the `inputbucket` S3 bucket. This will kick off the SageMaker pipeline, initiating three separate executions and deploying three models on the SageMaker endpoint. The process is expected to take approximately 45 minutes, and you can monitor the progress through the SageMaker Studio pipeline UI.
For each dimension in our dataset, a corresponding model registry will be created. In our current demonstration, where we have three dimensions, three model registries will be generated. Each model registry will encompass all the models associated with its respective dimension, ensuring a dedicated registry for each dimension.
113
+
114
+

115
+
116
+
After successfully executing all pipelines and deploying models on the SageMaker endpoint, copy the files from `dataset/inference-dataset` to the `inferencebucket` S3 bucket. Subsequently, the records are read, and the inference results are stored in a DynamoDB table. It's important to note that the inference data has already undergone preprocessing for seamless integration with the endpoint. In a production setting, it is recommended to implement an inference pipeline to preprocess input data consistently with the training data, ensuring alignment between training and production data.
117
+
118
+

119
+
120
+
## Useful commands
121
+
122
+
*`cdk ls` list all stacks in the app
123
+
*`cdk synth` emits the synthesized CloudFormation template
124
+
*`cdk deploy` deploy this stack to your default AWS account/region
125
+
*`cdk diff` compare deployed stack with current state
126
+
*`cdk docs` open CDK documentation
9
127
10
128
## Security
11
129
@@ -14,4 +132,3 @@ See [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) for more inform
14
132
## License
15
133
16
134
This library is licensed under the MIT-0 License. See the LICENSE file.
0 commit comments