diff --git a/create-use-case/prepare-dataset.mdx b/create-use-case/prepare-dataset.mdx index c8e15f2..448b6c9 100644 --- a/create-use-case/prepare-dataset.mdx +++ b/create-use-case/prepare-dataset.mdx @@ -41,20 +41,19 @@ ingestor = CSVIngestor( Make sure Docker is running on your system (e.g. by starting Docker Desktop), then execute the following command: ```bash -# Build for cloud and push directly to registry -docker buildx build --platform linux/amd64 -t /: --push . +# Build for cloud (multi-arch) and push directly to registry +docker buildx build --platform linux/amd64,linux/arm64 -t /: --push . ``` 3. Edit ingestor-job.yaml: - `metadata.name`: Unique job name (e.g. ingestor-job-train and ingestor-job-test) - `image`: The tag you built and pushed -- `LABEL_FILE`: Path inside container (e.g. /data/train.csv). - Points to csv file with labels and/or data in case of tabular data +- `LABEL_FILE`: Path inside the pod to the labels CSV, under the PVC mount (e.g. `/data/shared/labels.csv`). For tabular data, this is the same file that contains both labels and features. - `TABLE_NAME`: Unique table name (no spaces, one per dataset). Title is optional -- `PATH_TO_LOCAL_DATASET_FILE`: Path to your dataset file within the container -- `SRC_PATH`: Root inside the container where your files are mounted +- `SRC_PATH`: Root of the mounted dataset directory inside the pod (`/data/shared`, backed by `~/.tracebloc//data` on the client host) 4. Deploy to Kubernetes ```bash -`kubectl apply -f ingestor-job.yaml -n ` +`kubectl apply -f ingestor-job.yaml -n ` ``` ## Detailed Setup @@ -85,6 +84,9 @@ Each template is already configured with the correct data category and format: | Data Type | Template File | Data Category | Data Format | |-----------|---------------|---------------|-------------| | Tabular | templates/tabular_classification/tabular_classification.py | `TaskCategory.TABULAR_CLASSIFICATION` | `DataFormat.TABULAR` | +| Tabular | templates/tabular_regression/tabular_regression.py | `TaskCategory.TABULAR_REGRESSION` | `DataFormat.TABULAR` | +| Tabular | templates/time_series_forecasting/time_series_forecasting.py | `TaskCategory.TIME_SERIES_FORECASTING` | `DataFormat.TABULAR` | +| Tabular | templates/time_to_event_prediction/time_to_event_prediction.py | `TaskCategory.TIME_TO_EVENT_PREDICTION` | `DataFormat.TABULAR` | | Image | templates/image_classification/image_classification.py | `TaskCategory.IMAGE_CLASSIFICATION` | `DataFormat.IMAGE` | | Image | templates/object_detection/object_detection.py | `TaskCategory.OBJECT_DETECTION` | `DataFormat.IMAGE` | | Text | templates/text_classification/text_classification.py | `TaskCategory.TEXT_CLASSIFICATION` | `DataFormat.TEXT` | @@ -180,7 +182,7 @@ object_detection_options = { Define file extensions. ```python -text_options = {"allowed_extension": FileExtension.TXT} # Allowed text file extensions +text_options = {"extension": FileExtension.TXT} # Allowed text file extensions ``` #### Set CSV ingestion options @@ -231,23 +233,46 @@ Other data types work similarly, follow the same configuration pattern using the With your template configured, the next step is to package it into a Docker image so it can run inside the Kubernetes cluster. -### Edit Dockerfile +### Docker Hub Setup (first-time users) + +The cluster pulls your ingestor image from a public Docker registry, so you need an account before you can push. If you already have one, skip to [Edit Dockerfile](#edit-dockerfile). + +1. **Create a Docker Hub account** at [hub.docker.com/signup](https://hub.docker.com/signup) and verify your email. +2. **Log in from your terminal** so the `docker push` command can authenticate: + + ```bash + docker login + ``` + +3. **Push the data ingestor image** to your account using the build/push commands in the next section. The image name takes the form `/:` — the username segment must match the account you just created. +4. **Make the image public** so the cluster can pull it without credentials: + - Go to [hub.docker.com/repositories](https://hub.docker.com/repositories), open the repository you just pushed. + - Click **Settings → Visibility settings → Make public**. + + Keeping the image private is also fine, but then you must create a Kubernetes `imagePullSecret` named `regcred` in the client namespace (the `ingestor-job.yaml` already references it). + +### Place data files on the client host -Before building the image, update your `Dockerfile` so that both the dataset and the ingestion script are copied into the container. This ensures the ingestor has everything it needs at runtime, independent of your local file system. +Datasets are **not** baked into the Docker image. They live on the client host in the per-workspace data directory and are mounted into the ingestor pod through the shared PVC (`client-pvc` → `/data/shared`). -#### Copy data files -For all use cases except tabular data (where labels and features are contained within a single labels.csv file), copy the data files into the Docker container: +Copy your dataset into the client's data directory, where `` is the workspace name you chose during client install (which is also the Helm release name and the Kubernetes namespace — the chart uses the same value for all three). The directory `~/.tracebloc//data/` is created automatically by the installer; just drop your files into it: ```bash -# Needed for image and text data: Copy source data into the container to /app -COPY LOCAL_PATH/images/ app/images/ -# Copy labels to /app -COPY LOCAL_PATH/labels.csv /app/labels.csv +# Host path on the machine where the tracebloc client is installed. +# HOST_DATA_DIR defaults to ~/.tracebloc; override only if you set it during install. +cp -R LOCAL_PATH/images ~/.tracebloc//data/ +cp LOCAL_PATH/labels.csv ~/.tracebloc//data/ ``` -Then, move the ingestion script over to the container as well: +Inside the ingestor pod this directory is mounted at `/data/shared`, so the same files appear as `/data/shared/images/...` and `/data/shared/labels.csv`. Set `SRC_PATH` and `LABEL_FILE` in `ingestor-job.yaml` to point at those in-pod paths (see [Configure Kubernetes](#3-configure-kubernetes) below). -```bash +For tabular data the same rule applies — drop the single `labels.csv` (with features and labels) into `~/.tracebloc//data/`. + +### Edit Dockerfile + +The Dockerfile only needs to package the ingestion script — the dataset is mounted at runtime, so do **not** `COPY` data into the image: + +```dockerfile # Copy the ingestion script into /app COPY templates/tabular_classification/tabular_classification.py /app/ingestor.py ``` @@ -255,7 +280,7 @@ COPY templates/tabular_classification/tabular_classification.py /app/ingestor.py ### Build Docker Image -You need a docker user and password to proceed with the next step. Most cloud platforms (AWS, Azure, GCP) run on Linux AMD64. Specifying `--platform linux/amd64` guarantees compatibility, particularly if you build images on Apple Silicon (M1/M2) or other ARM-based systems. Pick a setup, build and deploy the image: +You need a docker user and password to proceed with the next step. Cloud platforms run a mix of x86 and ARM nodes (e.g. AWS Graviton, Azure Ampere, GCP Tau T2A). Building a multi-arch image with `--platform linux/amd64,linux/arm64` guarantees the image runs on either, particularly if you build on Apple Silicon (M1/M2) or other ARM-based systems. Pick a setup, build and deploy the image: #### For Local Development/Testing @@ -270,11 +295,8 @@ docker push /: #### For Cloud Deployment (AWS, Azure, GCP) ```bash -# Build for Linux AMD64 (required for most cloud platforms) -docker build --platform linux/amd64 -t /: . - -# Build and push directly to registry -docker buildx build --platform linux/amd64 -t /: --push . +# Build a multi-arch image (works on x86 and ARM cloud nodes) and push directly to the registry +docker buildx build --platform linux/amd64,linux/arm64 -t /: --push . ``` @@ -287,7 +309,7 @@ apiVersion: batch/v1 kind: Job metadata: name: # Set a job name e.g. ingestor-job-train - namespace: # Use the client namespace + namespace: # Use the client namespace spec: template: spec: @@ -297,7 +319,7 @@ spec: imagePullPolicy: Always # Use IfNotPresent only for local tests volumeMounts: - name: shared-volume - mountPath: "/data/shared" # Client shared storage. Target for copied files, not the local source path + mountPath: "/data/shared" # Client shared PVC. Backed by ~/.tracebloc//data on the client host — read your dataset from here env: # Client credentials - name: CLIENT_ENV @@ -315,25 +337,23 @@ spec: - name: MYSQL_HOST # value has to match the mysql deployment name in the client values.yaml value: "mysql-client" - # Dataset information + # Dataset information — paths inside the ingestor pod. + # /data/shared is the mount of the client-pvc, which is backed by + # ~/.tracebloc//data on the client host. - name: SRC_PATH - value: "/app" # Source folder path within the data ingestor + value: "/data/shared" # Root of the mounted dataset directory - name: LABEL_FILE - value: # Example: "/app/labels.csv" - - name: COMPANY - value: + value: "/data/shared/labels.csv" # Path to the labels CSV inside the pod - name: TABLE_NAME value: # Different for train and test, no spaces - name: TITLE value: # Optional - name: BATCH_SIZE - value: "4000" # Number of entries per request. Depends on CPU memory, not data size. 5,000 is a safe default, tested up to 10,000. + value: "4000" # Optional, defaults to 4000 - name: LOG_LEVEL value: "DEBUG" # Set DEBUG, "WARNING", "INFO" or "ERROR" imagePullSecrets: - name: regcred - nodeSelector: - type: system volumes: - name: shared-volume persistentVolumeClaim: @@ -347,11 +367,9 @@ spec: - `image`, your Docker image (imagePullPolicy: Always for DockerHub, IfNotPresent for local) - `CLIENT_ID`, `CLIENT_PASSWORD` from the [tracebloc client view](https://ai.tracebloc.io/clients) - `TABLE_NAME`, unique per dataset, train and test use different names, no spaces. Different names for train and test data is mandatory -- `LABEL_FILE`, path inside the ingestor container, for images this is usually a CSV with file path and label columns. Ensure it matches the copy path in the `Dockerfile` -- `PATH_TO_LOCAL_DATASET_FILE`, path to your dataset file within the container -- `SRC_PATH`, root inside the container where your files are mounted -- `YOUR_COMPANY_OR_ORGANISATION_NAME`, chose a suitable company or organisation name -- `BATCH_SIZE`, number of entries sent per request. Depends on available CPU memory, not data size (e.g. image dimensions). Too large can exhaust memory. Tested up to 10,000, but 5,000 is a safe default for most systems. +- `LABEL_FILE`, path inside the ingestor pod (under `/data/shared`) to the CSV with file paths and labels — must match the location of the file you placed in `~/.tracebloc//data/` +- `SRC_PATH`, root inside the pod where the dataset directory is mounted (`/data/shared`) +- `BATCH_SIZE` is the number of entries sent to the server per request. Optional — defaults to 4000. Keep it consistent across data types. It depends on available CPU memory, not for example image size. Too large can exhaust memory. It was tested up to 10,000, but 5,000 is a safe default for most systems. - `LOG_LEVEL`, "WARNING" for all warnings and errors, "INFO" for all logs, "ERROR" for errors only ### 4. Deploy @@ -359,12 +377,12 @@ spec: Run the ingestor as a Kubernetes Job: ```bash -kubectl apply -f ingestor-job.yaml -n -kubectl wait -n --for=condition=complete job/ -kubectl logs -n job/ +kubectl apply -f ingestor-job.yaml -n +kubectl wait -n --for=condition=complete job/ +kubectl logs -n job/ # Delete the job only after verifying logs -kubectl delete -n job/ +kubectl delete -n job/ ``` This will start a pod, run the ingestion process once, and once complete you can delete the job. @@ -378,8 +396,8 @@ The data ingestor always runs a validation step before ingestion and moving file Verify if jobs and pods are deployed successfully and running: ```bash -kubectl get jobs,pods -n -kubectl logs -n +kubectl get jobs,pods -n +kubectl logs -n ``` Look for "All records processed successfully" in the logs. @@ -402,17 +420,17 @@ View your datasets at [ai.tracebloc.io/data](https://ai.tracebloc.io/data) after ## Troubleshooting -**Recommended for debugging:** Use [k9s](https://k9scli.io/), a terminal-based Kubernetes dashboard, to monitor jobs, pods, and logs in real time. Run `k9s -n ` to get a live view of resources, switch between them instantly, and inspect logs or events with a few keystrokes. Compared to kubectl, it is faster and more convenient. +**Recommended for debugging:** Use [k9s](https://k9scli.io/), a terminal-based Kubernetes dashboard, to monitor jobs, pods, and logs in real time. Run `k9s -n ` to get a live view of resources, switch between them instantly, and inspect logs or events with a few keystrokes. Compared to kubectl, it is faster and more convenient. **Stale Kubernetes Job preventing new Job execution:** ```bash -kubectl delete job ingestor-job -n +kubectl delete job ingestor-job -n kubectl logs ``` **Storage Issues:** ```bash -kubectl get pvc -n +kubectl get pvc -n ``` --- diff --git a/create-use-case/prerequisites.mdx b/create-use-case/prerequisites.mdx index b75129e..dde87bc 100644 --- a/create-use-case/prerequisites.mdx +++ b/create-use-case/prerequisites.mdx @@ -25,16 +25,20 @@ Once these requirements are met, proceed with: **Requirements for all image data tasks:** Uniform image sizes and uniform file types. For example all images as 256x256 rgb .jpg files. Convert files if necessary and in case your images do not fit the supported size, crop or resize accordingly. + +**Filenames in the label CSV:** For all image and text tasks, the `filename` column in the label CSV **must not include the file extension** (e.g. use `cat01`, not `cat01.jpg`). The extension is configured once on the ingestor side via `file_options.extension` in the template and applied to every row at ingestion time. + + All images are validated before ingestion by the data ingestor. The ingestion process only starts when every file meets the requirements. Fix or remove any invalid images, then retry. For object detection, images and annotations are automatically up- or downsized to 448x448 pixels. | Task | Input file type | Color mode | Supported image size | Label file type | Requirements | Links | |-----------------------|-----------------|--------------------------------------------------------------|----------------|-----------------------------------|----------------------------------------------|----------| -| Classification | PNG, JPG, JPEG | rgb (3 channels) or grayscale (1 channel), 8-bit per channel | Square (height = width) | CSV | Uniform image size and file type per dataset | [Detailed structure](#image-classification)

[Example](https://github.com/tracebloc/data-ingestors/tree/develop/templates) | -| Keypoint Detection | PNG, JPG, JPEG | rgb (3 channels) or grayscale (1 channel), 8-bit per channel | Square (height = width) | CSV | Uniform image sizes

Same number of keypoints per image and class | [Detailed structure](#image-keypoint-detection)

[Example](https://github.com/tracebloc/data-ingestors/tree/develop/templates) | -| Object Detection | PNG, JPG, JPEG | rgb (3 channels) or grayscale (1 channel), 8-bit per channel | Square (height = width). Images and annotations will be resized to 448x448 px automatically | Pascal VOC | Uniform image sizes, one xml per image | [Detailed structure](#image-object-detection)

[Example](https://github.com/tracebloc/data-ingestors/tree/develop/templates) | -| Semantic Segmentation | PNG, JPG, JPEG | rgb (3 channels) or grayscale (1 channel), 8-bit per channel | Square (height = width) | PNG, JPG, JPEG | Uniform image and mask sizes | [Detailed structure](#image-semantic-segmentation)

[Example](https://github.com/tracebloc/data-ingestors/tree/develop/templates) | +| Classification | PNG, JPG, JPEG | rgb (3 channels) or grayscale (1 channel), 8-bit per channel | Square (height = width) | CSV | Uniform image size and file type per dataset | [Detailed structure](#image-classification)

[Example](https://github.com/tracebloc/data-ingestors/tree/develop/templates/image_classification) | +| Keypoint Detection | PNG, JPG, JPEG | rgb (3 channels) or grayscale (1 channel), 8-bit per channel | Square (height = width) | CSV | Uniform image sizes

Same number of keypoints per image and class | [Detailed structure](#image-keypoint-detection)

Template coming soon — contact [support@tracebloc.io](mailto:support@tracebloc.io) | +| Object Detection | PNG, JPG, JPEG | rgb (3 channels) or grayscale (1 channel), 8-bit per channel | Square (height = width). Images and annotations will be resized to 448x448 px automatically | Pascal VOC | Uniform image sizes, one xml per image | [Detailed structure](#image-object-detection)

[Example](https://github.com/tracebloc/data-ingestors/tree/develop/templates/object_detection) | +| Semantic Segmentation | PNG, JPG, JPEG | rgb (3 channels) or grayscale (1 channel), 8-bit per channel | Square (height = width) | PNG, JPG, JPEG | Uniform image and mask sizes | [Detailed structure](#image-semantic-segmentation)

Template coming soon — contact [support@tracebloc.io](mailto:support@tracebloc.io) | ### Image Classification @@ -56,7 +60,7 @@ dog02,dog ... ``` -The filename should not include the file extension. +The `filename` column must not include the file extension. Set the expected extension once via `file_options.extension` in the ingestor template. ### Image Keypoint Detection @@ -125,7 +129,7 @@ test/ Each row represents one detected object, not one image. An image with multiple objects will have multiple rows. ``` labels.csv -filename,label +filename,image_label street01,car street01,car street01,person @@ -133,6 +137,8 @@ street02,car ... ``` +The `filename` column links each row to its image and to the matching XML annotation file. The `image_label` column holds the class name for each object instance — one row per object. Filenames should not include the file extension. + ### Image Semantic Segmentation Each mask is an rgb image whose pixel values map to classes defined in labels.csv. The labels.csv contains a global list per image and class. All masks must exactly match their corresponding image sizes and file names. @@ -170,9 +176,12 @@ image2,mask2,road,#FFFFFF **Requirements for all tabular data tasks:** Each dataset must be provided as a single CSV file with a header row. Every column must contain uniform data types, for example numeric values for features and a categorical or alphanumeric column for labels. Use UTF-8 encoding with comma separators and validate that your schema matches the expected types. Invalid rows are skipped by the ingestor. -| Task | Data file type | Requirements | Links | -|----------------|--------------------|---------------------------------------------------------------------------------------------------------|----------| -| Classification | CSV (features and label in one single file) | Uniform data formats per column.

Feature columns: Numeric

Label columns: Alphanumeric | [Detailed structure](#tabular-classification)

[Example](https://github.com/tracebloc/data-ingestors/tree/develop/templates) | +| Task | Data file type | Requirements | Links | +|--------------------------|---------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------|----------| +| Classification | CSV (features and label in one single file) | Uniform data formats per column.

Feature columns: Numeric

Label columns: Alphanumeric | [Detailed structure](#tabular-classification)

[Example](https://github.com/tracebloc/data-ingestors/tree/develop/templates/tabular_classification) | +| Regression | CSV (features and label in one single file) | Uniform data formats per column.

Feature columns: Numeric

Label column: Numeric (continuous target) | [Detailed structure](#tabular-regression)

[Example](https://github.com/tracebloc/data-ingestors/tree/develop/templates/tabular_regression) | +| Time Series Forecasting | CSV (timestamp, features and target in one single file) | A timestamp column in a parsable format (e.g. `YYYY-MM-DD` or ISO 8601).

Feature columns: Numeric

Target column: Numeric | [Detailed structure](#time-series-forecasting)

[Example](https://github.com/tracebloc/data-ingestors/tree/develop/templates/time_series_forecasting) | +| Time to Event Prediction | CSV (features, time and event in one single file) | A `time` column (duration until event or censoring, integer or numeric).

An event column (binary 0/1 indicating whether the event occurred).

Feature columns: Numeric | [Detailed structure](#time-to-event-prediction)

[Example](https://github.com/tracebloc/data-ingestors/tree/develop/templates/time_to_event_prediction) | ### Tabular Classification @@ -186,19 +195,51 @@ id,feature1,feature2,feature3,label ... ``` -### Regression +### Tabular Regression + +Same structure as Tabular Classification, but the label column holds a continuous numeric target (not a class). + +``` csv +id,square_feet,bedrooms,age,price +1,1668.08,3,15,285.50 +2,1701.78,4,12,320.75 +3,1697.01,2,8,245.30 +... +``` + +### Time Series Forecasting + +Provide a single CSV with a timestamp column, one or more numeric feature columns, and the numeric target column you want to forecast. Rows must be ordered by time and use a consistent timestamp format. -Regression tasks are supported on the platform. Evaluation uses Mean Absolute Error (MAE). For details on supported metrics, see [Supported Metrics](/create-use-case/define#supported-metrics-per-data-type-and-task). Reach out to us at [support@tracebloc.io](mailto:support@tracebloc.io) for guidance on data format requirements for regression use cases. +``` csv +timestamp,feature_1,feature_2,target +2023-10-01,7,1,125.50 +2023-10-02,1,0,132.30 +2023-10-03,2,0,128.75 +... +``` + +### Time to Event Prediction + +Provide a single CSV with feature columns, a `time` column (duration to event or censoring), and a binary event column (1 = event occurred, 0 = censored). + +``` csv +age,feature_1,feature_2,time,event +75,0,1.9,4,1 +55,0,1.1,6,1 +65,0,1.3,7,0 +... +``` ### Text Data | Task | Input files | Label file type | Requirements | Links | |-----------------------|----------------|-----------------------------------|---------------------------------------------------------------------------------------------------------|----------| -| Classification | TXT | CSV | Text file may not be empty | [Detailed structure](#text-classification)

[Example](https://github.com/tracebloc/data-ingestors/tree/develop/templates) | +| Classification | TXT | CSV | Text file may not be empty | [Detailed structure](#text-classification)

[Example](https://github.com/tracebloc/data-ingestors/tree/develop/templates/text_classification) | ### Text Classification -The filename should not include the file extension. +The `filename` column must not include the file extension. The extension is set once via `file_options.extension` in the ingestor template (e.g. `FileExtension.TXT`). ``` structure train/ diff --git a/environment-setup/configuration.mdx b/environment-setup/configuration.mdx index 7ccee59..7154e4d 100644 --- a/environment-setup/configuration.mdx +++ b/environment-setup/configuration.mdx @@ -24,7 +24,7 @@ Override defaults by setting environment variables before the install command. U Example — custom cluster name with two worker nodes: ```bash -CLUSTER_NAME=my-cluster AGENTS=2 bash <(curl -fsSL https://tracebloc.io/install.sh) +CLUSTER_NAME=my-cluster AGENTS=2 bash <(curl -fsSL https://tracebloc.io/i.sh) ``` ## Cluster Management @@ -96,10 +96,12 @@ The installer does **not** install GPU drivers on Windows. Pre-install NVIDIA dr Skip the installer entirely. Use this if you already have a Kubernetes cluster, need custom resource limits, or want full control over the Helm deployment. +A single unified chart — **`tracebloc/client`** — supports AKS, EKS, bare-metal, and OpenShift. Platform behaviour is selected via values overrides; reference defaults live in the repo at [`client/ci/{aks,eks,bm,oc}-values.yaml`](https://github.com/tracebloc/client/tree/main/client/ci). + ### Add the Helm repository ```bash -helm repo add tracebloc https://tracebloc.github.io/client/ +helm repo add tracebloc https://tracebloc.github.io/client helm repo update ``` @@ -115,52 +117,146 @@ helm show values tracebloc/client > values.yaml #### Authentication -Connect the client to your tracebloc account: +Set your Client ID and password from the [tracebloc client view](https://ai.tracebloc.io/clients): ```yaml clientId: "" clientPassword: "" ``` -#### Resource Limits +#### Resource Limits for Training Jobs -Control how much CPU, memory, and GPU each training job can consume. Size these according to your workloads and available hardware: +Defaults are sized for typical workloads. Override per job size; for GPU support, requests and limits **must** be equal: ```yaml env: RESOURCE_REQUESTS: "cpu=2,memory=8Gi" RESOURCE_LIMITS: "cpu=2,memory=8Gi" - GPU_REQUESTS: "" # "nvidia.com/gpu=1" for GPU - GPU_LIMITS: "" # "nvidia.com/gpu=1" for GPU - RUNTIME_CLASS_NAME: "" # "nvidia" for GPU with k3s + GPU_REQUESTS: "" # "nvidia.com/gpu=1" for GPU + GPU_LIMITS: "" # "nvidia.com/gpu=1" for GPU + RUNTIME_CLASS_NAME: "" # "nvidia" for k3s GPU ``` #### Storage -Persistent volumes for the database, logs, and training data. Adjust sizes based on your dataset: +Storage class and PVC sizes: ```yaml storageClass: create: true - name: client-storage-class - provisioner: manual + provisioner: "" # set per platform (see ci/*-values.yaml) allowVolumeExpansion: true parameters: {} +# Bare-metal only — hostPath-backed PVs at /tracebloc/{data,logs,mysql} hostPath: - enabled: true + enabled: false pvc: mysql: 2Gi logs: 10Gi data: 50Gi +``` + +Platform snippets (drop into your values file): + +
+AKS + +```yaml +storageClass: + create: true + provisioner: file.csi.azure.com + parameters: + skuName: Standard_LRS + mountOptions: + - dir_mode=0750 + - file_mode=0640 + - uid=999 + - gid=999 + - mfsymlinks + - cache=strict + - actimeo=30 +clusterScope: true +``` +
+ +
+EKS +```yaml +storageClass: + create: true + provisioner: efs.csi.aws.com + volumeBindingMode: Immediate + reclaimPolicy: Retain + mountOptions: [actimeo=30] + parameters: + directoryPerms: "700" + uid: "999" + gid: "999" + fileSystemId: + provisioningMode: efs-ap +clusterScope: true +``` +
+ +
+Bare-metal / k3s / k3d + +```yaml +hostPath: + enabled: true pvcAccessMode: ReadWriteOnce +storageClass: + create: true + provisioner: kubernetes.io/no-provisioner +namespace: + podSecurity: + enforce: "" # hostPath needs the privileged init-mysql-data container + enforceVersion: "" +clusterScope: true ``` +
-#### Proxy (optional) +
+OpenShift -Only needed if your machine accesses the internet through a corporate proxy: +```yaml +storageClass: + create: false + name: ocs-storagecluster-cephfs +clusterScope: false +openshift: + scc: + enabled: true +networkPolicy: + training: + enabled: true + dnsNamespace: openshift-dns + dnsSelector: + dns.operator.openshift.io/daemonset-dns: default + clusterCidrs: + - "10.128.0.0/14" + - "172.30.0.0/16" +``` +
+ +#### Docker Registry + +The chart pulls the client image from a container registry — credentials are required in production. Use a token, not a plaintext password. + +```yaml +dockerRegistry: + server: https://index.docker.io/v1/ + username: + password: + email: +``` + +The chart auto-creates a secret named `{{ .Release.Name }}-regcred`. Omit the `dockerRegistry` block entirely to skip pull-secret creation (e.g. when using a public mirror). + +#### Proxy (optional) ```yaml env: @@ -170,6 +266,129 @@ env: HTTP_PROXY_PASSWORD: "" ``` +#### Auto-upgrade (on by default) + +Releases of chart `1.3.0+` install a `-auto-upgrade` CronJob that polls `https://tracebloc.github.io/client` daily and runs `helm upgrade --reset-then-reuse-values` whenever a newer chart version is published. Closes [tracebloc/client#69](https://github.com/tracebloc/client/issues/69) — older deployed clients no longer drift from the latest secure release. + +```yaml +autoUpgrade: + enabled: true # set false to opt out + schedule: "23 2 * * *" # daily at 02:23 UTC + suspend: false # one-shot pause without removing resources + repoUrl: "https://tracebloc.github.io/client" + repoName: "tracebloc" + chartName: "client" + timeout: "10m" +``` + +The CronJob's ServiceAccount is bound to the built-in `cluster-admin` ClusterRole because the chart templates cluster-scoped resources (PriorityClass, StorageClass, ClusterRoleBinding, optionally Namespace). Disable if you need a manual approval gate on upgrades. + +#### NetworkPolicy hardening for training pods + +Training pods run untrusted ML code. The chart can apply a NetworkPolicy that denies ingress and restricts egress to DNS + external HTTPS only — blocking pod-to-pod, MySQL, and Kubernetes API access from the training pod. + +```yaml +networkPolicy: + training: + enabled: true + dnsNamespace: kube-system + dnsSelector: {} # empty falls back to {k8s-app: kube-dns} + clusterCidrs: + - "10.0.0.0/8" + - "172.16.0.0/12" + - "192.168.0.0/16" +``` + +Requires a CNI that **enforces** NetworkPolicy: + +| Platform | Notes | +|----------|-------| +| AKS | needs `--network-policy azure` (Azure NPM) or Calico at cluster create | +| EKS | needs Calico or Cilium add-on (the default AWS VPC CNI alone does **not** enforce) | +| Bare-metal | needs Calico / Cilium / kube-router (Flannel alone does **not** enforce) | +| OpenShift | OVN-Kubernetes enforces by default | + +Leave `enabled: false` on clusters without an enforcing CNI — silently having no protection is worse than explicitly disabling it. + + +The chart's training-pod egress lockdown only blocks traffic if your CNI enforces NetworkPolicy. Verify your CNI before relying on it. + + +#### Resource Monitor and node-agents namespace + +The `tracebloc-resource-monitor` DaemonSet collects node-level CPU/memory metrics. It mounts `hostPath` volumes (`/proc`, `/sys`) which Pod Security Admission's `restricted` profile bans — so the chart isolates it in a dedicated **privileged** namespace (default `tracebloc-node-agents`). + +```yaml +resourceMonitor: true # set false on clusters where metrics-server cannot be installed +nodeAgents: + namespace: + create: true + name: tracebloc-node-agents +``` + +When `create: false`, create the namespace yourself with the required PSA labels: + +```bash +kubectl create namespace tracebloc-node-agents +kubectl label namespace tracebloc-node-agents \ + pod-security.kubernetes.io/enforce=privileged \ + pod-security.kubernetes.io/warn=privileged \ + pod-security.kubernetes.io/audit=privileged +``` + +The DaemonSet **requires** `metrics-server`. It is bundled on k3d/k3s/AKS, present on OpenShift, and **must be installed manually on EKS** (`kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml`). + +#### Pod Security Admission labels + +Training Jobs run untrusted user-supplied ML code. The chart can create the release namespace with Pod Security Admission `warn`/`audit`/`enforce` labels at the `restricted` profile for defense-in-depth: + +```yaml +namespace: + create: false # true only on greenfield installs + podSecurity: + warn: restricted + audit: restricted + enforce: restricted # set "" for bare-metal hostPath installs +``` + +When `create: false` (default) and you want PSA labels on an existing namespace: + +```bash +kubectl label namespace \ + pod-security.kubernetes.io/warn=restricted \ + pod-security.kubernetes.io/audit=restricted \ + pod-security.kubernetes.io/enforce=restricted +``` + +#### Image digest pinning + +Pin images by content hash for reproducible deploys. When `digest` is set, `tag` is ignored and `imagePullPolicy` drops to `IfNotPresent`. + +```yaml +images: + jobsManager: { digest: "sha256:..." } + podsMonitor: { digest: "sha256:..." } + resourceMonitor: { digest: "sha256:..." } + requestsProxy: { digest: "sha256:..." } + mysqlClient: { tag: "", digest: "" } + busybox: { tag: "1.35", digest: "" } +``` + +#### PriorityClass and PodDisruptionBudgets + +The chart pins the MySQL pod with a `tracebloc-data-plane` PriorityClass (value `1000000`) so it survives node-level OOM and scheduling pressure, and applies PDBs to MySQL and the jobs manager. Override only if you run a multi-replica MySQL externally: + +```yaml +priorityClass: + create: true + name: tracebloc-data-plane + value: 1000000 + +podDisruptionBudget: + mysql: { create: true } + jobsManager: { create: true } +``` + ### Deploy Install the chart into a new namespace: @@ -181,27 +400,46 @@ helm upgrade --install tracebloc/client \ --values values.yaml ``` -### Update +### Upgrade -Pull the latest chart version and apply your configuration: +The auto-upgrade CronJob handles routine version bumps. To upgrade manually: ```bash helm repo update helm upgrade tracebloc/client \ --namespace \ + --reset-then-reuse-values \ --values values.yaml ``` -### Uninstall + +When upgrading **into** chart 1.3.0 from 1.2.x, use `--reset-then-reuse-values` (not plain `--reuse-values`) — the new `autoUpgrade` block did not exist in 1.2.x and a plain reuse fails template rendering. + -Remove the client and all associated resources: +### Uninstall ```bash helm uninstall -n +``` + +PVCs and the PriorityClass are annotated `helm.sh/resource-policy: keep` so your data and shared cluster resources survive uninstall. To remove them too: + +```bash kubectl delete pvc --all -n kubectl delete namespace ``` +### Migrating from legacy charts + +If you installed before chart 1.3.x using `tracebloc/aks`, `tracebloc/eks`, or `tracebloc/bm`, see the [migration guide in the client repo](https://github.com/tracebloc/client/blob/main/client/MIGRATION.md). Key changes: + +- 4 charts → 1 chart (`tracebloc/client`) with platform values overrides +- Auth keys flattened: `jobsManager.env.CLIENT_ID` + `secrets.clientPassword` → top-level `clientId` + `clientPassword` +- PVC keys flattened: `clientData` / `clientLogsPvc` / `mysqlPvc` (with `name`, `storage`, `hostPath`) → `pvc.{data,logs,mysql}` (size only) + `hostPath.enabled` for bare-metal +- ServiceAccount renamed from `default` to `{{ .Release.Name }}-jobs-manager` +- Pull-secret renamed from hard-coded `regcred` to `{{ .Release.Name }}-regcred` +- The `namespace` value in the legacy `values.yaml` is gone — use `helm install -n ` instead + ## Security Tracebloc is designed so your data never has to leave your network. Here's how: diff --git a/environment-setup/eks-client-deployment-guide.mdx b/environment-setup/eks-client-deployment-guide.mdx index 835062b..abc9be3 100644 --- a/environment-setup/eks-client-deployment-guide.mdx +++ b/environment-setup/eks-client-deployment-guide.mdx @@ -185,7 +185,7 @@ This section walks through a step-by-step build with AWS CLI and kubectl. It mir 5. **Client Configuration** — Install the Amazon EFS (Elastic File System) CSI driver (Container Storage Interface) in your EKS (Elastic Kubernetes Service) cluster. This driver is what lets Kubernetes automatically create and mount EFS storage volumes. 6. **Client Deployment** — Add the tracebloc Helm repository, configure your deployment values (authentication credentials, registry access, storage settings, resource limits), install the chart into your chosen namespace. Deploy and verify that all pods are running and persistent volume claims are properly bound. -**Helm Usage**: Helm is used to install and manage Kubernetes applications. In steps 5 and 6 you will deploy the tracebloc client via the tracebloc/eks chart from the tracebloc Helm repository. +**Helm Usage**: Helm is used to install and manage Kubernetes applications. In steps 5 and 6 you will deploy the tracebloc client via the unified `tracebloc/client` chart with EKS-specific values. ## 1. VPC and Network Configuration @@ -631,156 +631,138 @@ A values.yaml file controls this deployment, where you specify credentials (to a ### Add Helm Repository -Install and update the tracebloc client using Helm instead of managing raw YAML. For details, refer to the public [GitHub Repository](https://github.com/tracebloc/client/tree/main/eks). +The tracebloc client is delivered as a single unified Helm chart (`tracebloc/client`) that supports AKS, EKS, bare-metal, and OpenShift. Source: [tracebloc/client](https://github.com/tracebloc/client). ```bash -helm repo add tracebloc https://tracebloc.github.io/client/ +helm repo add tracebloc https://tracebloc.github.io/client helm repo update ``` -Adds the official tracebloc Helm repository to your local configuration so you can install the client with a single Helm command. - ### Configure your Deployment Settings Export the chart's default configuration into a local file that you can edit: ```bash -helm show values tracebloc/eks > values.yaml +helm show values tracebloc/client > values.yaml ``` -Downloads the default configuration template for the tracebloc client. Open and update the following sections in `values.yaml`: - -#### Deployment Namespace - -Use the defined namespace: - -```yaml -namespace: -``` -Defines where the client will be deployed. +Open and update the following sections in `values.yaml`: #### Tracebloc Authentication -##### Client ID - -Provide your client ID from the [tracebloc client view](https://ai.tracebloc.io/clients). - -```yaml -jobsManager: - env: - CLIENT_ID: "" -``` -##### Client Password - -Set `create: true` to generate the secret during installation: +Provide your Client ID and password from the [tracebloc client view](https://ai.tracebloc.io/clients): ```yaml -# Secrets configuration -secrets: - # Whether to create the secret or use existing secret - create: true - # Client password - clientPassword: "" +clientId: "" +clientPassword: "" ``` #### Docker Registry Configuration -The tracebloc client images are stored in a private container registry. Kubernetes needs valid Docker Hub credentials to pull these images onto your nodes. +The tracebloc client images live on Docker Hub. Provide credentials so Kubernetes can pull them — the chart auto-creates a secret named `{{ .Release.Name }}-regcred`: ```yaml dockerRegistry: - create: true - secretName: regcred server: https://index.docker.io/v1/ username: - password: OR + password: email: ``` -- DOCKER_USERNAME: Docker Hub username -- DOCKER_PASSWORD: Password or access token (if 2FA enabled) -- DOCKER_TOKEN: Alternative token for automation or personal access -- DOCKER_EMAIL: Email linked to your Docker account +- `DOCKER_USERNAME`: Docker Hub username +- `DOCKER_PASSWORD` / `DOCKER_TOKEN`: Password, or an access token (preferred — required if 2FA is enabled) +- `DOCKER_EMAIL`: Email linked to your Docker account -#### Storage +#### Storage (EKS / EFS) -Your workloads need persistent storage that survives pod restarts and can be shared across all nodes. Kubernetes uses PersistentVolumeClaims (PVCs) to request storage, and in this setup those PVCs are backed by EFS. By linking PVCs to your EFS file system, training pods can read datasets, write logs, and store database files even if they move between nodes. Configure Persistent Volume Claims (PVCs) for datasets, logs, and MySQL: +The chart provisions an EFS-backed storage class via the EFS CSI driver. Use **EFS Access Points** (`provisioningMode: efs-ap`) for proper UID/GID enforcement on shared filesystems: ```yaml storageClass: - # Set to true to create a new storage class, false to use existing. Be careful not to overwrite existing datafiles by setting it true. create: true -... + provisioner: efs.csi.aws.com + volumeBindingMode: Immediate + reclaimPolicy: Retain + mountOptions: + - actimeo=30 parameters: + directoryPerms: "700" + uid: "999" + gid: "999" fileSystemId: + provisioningMode: efs-ap -... -sharedData: - name: shared-data - storage: 50Gi - -logsPvc: - name: logs-pvc - storage: 10Gi +clusterScope: true -mysqlPvc: - name: mysql-pvc - storage: 2Gi +pvc: + mysql: 2Gi + logs: 10Gi + data: 50Gi ``` -Add the FILE_SYSTEM_ID from your EFS setup. Adjust PVC sizes as needed. -**Options:** -- `create: true`: create a new storage class (overwrites existing data) -- `create: false`: reuse an existing class (keeps data intact) +Add the `FILE_SYSTEM_ID` from your EFS setup (step 4). Adjust PVC sizes as needed. -#### Set Resource Limits at Pod/Container Level +**StorageClass options:** +- `create: true` — chart creates a release-unique storage class (e.g. `-storage-class`). Each release gets its own. +- `create: false` — reuse an existing class; `name` must match. -Kubernetes schedules pods based on declared resource requests and enforces limits to prevent a single workload from monopolizing the cluster. If you do not set these, training jobs can consume all available CPU or memory, starving system components and other workloads and forcing jobs to restart. Setting GPU requests is equally important. Make sure the limits are within the AWS EC2 instance capacity you set previously. **IMPORTANT:** For the GPU setup, requests and limits must be the same. Kubernetes will reject configs where they differ. +#### Set Resource Limits for Training Jobs + +Each training Job inherits these resource requests/limits. Make sure they fit within your EC2 instance capacity. **IMPORTANT:** For GPU jobs, `requests` and `limits` must be equal — Kubernetes rejects configs where they differ. ```yaml +env: RESOURCE_REQUESTS: "cpu=2,memory=8Gi" RESOURCE_LIMITS: "cpu=2,memory=8Gi" - - GPU_REQUESTS: "" # for GPU support set "nvidia.com/gpu=1" - GPU_LIMITS: "" # for GPU support set ""nvidia.com/gpu=1" + GPU_REQUESTS: "" # for GPU support set "nvidia.com/gpu=1" + GPU_LIMITS: "" # for GPU support set "nvidia.com/gpu=1" ``` -These values define pod-level resource allocations. Kubernetes then places pods onto nodes (EC2s) that can satisfy them, based on current usage. - -**In short:** Set pod requests to what the training actually needs but keep limits well inside the EC2 capacity. - -**Tip:** Estimating VRAM requirements for LLMs can be tricky. Use the [VRAM Calculator](https://apxml.com/tools/vram-calculator) - to approximate memory needs for different model sizes and batch configurations. +**Tip:** Estimating VRAM requirements for LLMs can be tricky. Use the [VRAM Calculator](https://apxml.com/tools/vram-calculator) to approximate memory needs for different model sizes and batch configurations. **Node, Pod, Job relationship** -- One EC2 equals one node, one node runs many pods -- One pod contains one or more containers -- One training run equals one Job, which in most cases creates one pod by default - -Pods are lightweight, many pods can share one EC2. - +- One EC2 equals one node; one node runs many pods. +- One pod contains one or more containers. +- One training run equals one Job, which in most cases creates one pod by default. #### Proxy Settings (Optional) -These settings are only required if your EKS worker nodes need to reach the internet through a corporate or institutional proxy or firewall. Without them, the tracebloc client may fail to pull container images from Docker Hub or connect to the tracebloc backend. +Required only if your EKS worker nodes reach the internet through a corporate proxy: ```yaml - # proxy hostname. +env: HTTP_PROXY_HOST: - # proxy port. HTTP_PROXY_PORT: - # username used for proxy authentication if needed. HTTP_PROXY_USERNAME: - # password used for proxy authentication if needed. HTTP_PROXY_PASSWORD: ``` -If your cluster is deployed in the VPC with direct outbound internet access, you can leave these fields empty. In restricted environments, coordinate with your cloud or network team -### Optional: Repeated Setups +#### NetworkPolicy for Training Pods + +The chart can apply a NetworkPolicy that denies ingress and restricts egress to DNS + external HTTPS only — blocking pod-to-pod, MySQL, and Kubernetes API access from training pods. + +```yaml +networkPolicy: + training: + enabled: false # see note below +``` + +**EKS caveat:** the default AWS VPC CNI does **not** enforce NetworkPolicy. The CI defaults ship with `enabled: false` to avoid a false sense of security. If your cluster runs Calico or Cilium as an add-on, set `enabled: true`. + +#### Resource Monitor (metrics-server) + +The chart's resource-monitor DaemonSet polls `/apis/metrics.k8s.io/v1beta1` and **requires `metrics-server`**. EKS does not install it by default — install it once per cluster before deploying the chart: -Change the RBAC `clusterRole.create: name` in your `values.yaml` to a new name, e.g. your namespace. +```bash +kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml +``` + +If you cannot install metrics-server, disable the DaemonSet: +```yaml +resourceMonitor: false +``` ## 6. Client Deployment @@ -788,21 +770,24 @@ With the configuration ready, deploy the tracebloc client into your EKS cluster ### Deploy the client with Helm -Install the tracebloc Helm chart into the specified namespace and a suitable release name using your customized values file: +Install the chart into your namespace using your customized values file: ```bash -helm install tracebloc/eks \ +helm install tracebloc/client \ --namespace \ + --create-namespace \ --values values.yaml ``` + This creates: - One MySQL pod: `mysql-...` -- One tracebloc client pod: `tracebloc-jobs-manager-...` -- Supporting resources such as a Service, ConfigMap, Secret, and PVC - +- One tracebloc client pod: `-jobs-manager-...` +- A `tracebloc-resource-monitor` DaemonSet in the `tracebloc-node-agents` namespace +- A `-auto-upgrade` CronJob for daily chart upgrades (see [Configuration → Auto-upgrade](/environment-setup/configuration#auto-upgrade-on-by-default)) +- Supporting resources: Service, ConfigMap, Secrets, PVCs, PriorityClass, PDBs -Expect a few minutes for all pods to pull images, create persistent volumes, and reach Running state. +Expect a few minutes for pods to pull images, bind PVCs, and reach `Running`. ## Verification and Maintenance @@ -831,16 +816,20 @@ kubectl get pvc -n Verifies that PVCs are bound and storage is available. ### Maintenance + +The chart's auto-upgrade CronJob handles routine version bumps daily. To upgrade manually: + #### Update your values: ```bash -helm show values tracebloc/eks > new-values.yaml +helm show values tracebloc/client > new-values.yaml # Edit new-values.yaml with your changes ``` #### Upgrade the deployment: ```bash -helm upgrade tracebloc tracebloc/eks \ +helm upgrade tracebloc/client \ --namespace \ + --reset-then-reuse-values \ --values new-values.yaml ``` diff --git a/environment-setup/setup-guide.mdx b/environment-setup/setup-guide.mdx index c48e577..5006aff 100644 --- a/environment-setup/setup-guide.mdx +++ b/environment-setup/setup-guide.mdx @@ -59,13 +59,13 @@ Your client moves through these states as it goes from registration to running: ## 3. Deploy -One command sets up your entire workspace. The installer is idempotent — it detects what's already installed and skips it, so it's safe to re-run at any time. +One command sets up your entire workspace on any machine — macOS, Linux, or Windows. The installer is idempotent: it detects what's already installed and skips it, so it's safe to re-run at any time. ```bash -bash <(curl -fsSL https://tracebloc.io/install.sh) +bash <(curl -fsSL https://tracebloc.io/i.sh) ``` @@ -74,28 +74,46 @@ bash <(curl -fsSL https://tracebloc.io/install.sh) Open **PowerShell as Administrator**: ```powershell -irm https://tracebloc.io/install.ps1 | iex +irm https://tracebloc.io/i.ps1 | iex ``` -The installer prompts for three things: +Nothing on your machine is modified outside: -1. **Workspace name** — a namespace for your Kubernetes deployment, e.g. `berlin-team`, `vision-lab` -2. **Client ID** — from step 2 -3. **Client password** — from step 2 +- `~/.tracebloc/` — data and config +- **Docker** — container runtime ### What the Installer Does -Behind the scenes, the installer builds a complete local Kubernetes environment: +The installer runs four clearly labelled steps: -1. **Detects your system** — OS, architecture, and GPU hardware -2. **Installs dependencies** — Docker, k3d, kubectl, Helm (skips what's already present) -3. **Creates a Kubernetes cluster** — a lightweight k3s cluster running inside Docker, with all persistent data stored in `~/.tracebloc/` -4. **Deploys the tracebloc client** — via Helm chart, configured with your credentials +**Step 1/4 — Check system requirements** +Verifies Docker is installed and running, detects GPU hardware (falls back to CPU mode if none), and installs missing system tools (e.g. `conntrack`). -Install logs are saved to `~/.tracebloc/install-*.log` if you need to debug anything. +**Step 2/4 — Set up secure compute environment** +Provisions a lightweight local Kubernetes cluster inside Docker. First run takes 1–2 minutes to download components. + +**Step 3/4 — Install tracebloc client** +Prompts for a **workspace name** (e.g. `berlin-team`, `vision-lab`, `ml-mardan`). This identifies the client on your machine and becomes the Kubernetes namespace. + +**Step 4/4 — Connect to tracebloc network** +Prompts for your **Client ID** and **password** from step 2 above. This links your secure local environment to the tracebloc platform so vendors can submit models for evaluation. + +When it finishes you'll see a summary like: + +``` +tracebloc client installed successfully + +Workspace : +Mode : CPU # or GPU + +Logs: ~/.tracebloc/ +Data: /tracebloc/ +``` + +Install logs are kept in `~/.tracebloc/` if you need to debug anything. ### GPU Support