hyp create hyp-pytorch-job rejects configurations that specify both node_count and explicit resource fields (accelerators, vcpu, memory). The CLI raises:
❌ Either node-count OR a combination of accelerators, vcpu, memory-in-gib must be specified for instance-type
ml.p4d.24xlarge
But both are needed simultaneously: node_count controls the number of replicas, while the resource fields control per-pod requests/limits. The underlying HyperPodPyTorchJob CRD supports both together.
Without explicit resource requests, the operator can auto-calculate resource requests that exceed what's actually available after system pod overhead, causing Kueue to never admit the job. This can make multi-node jobs with Kueue scheduling unusable through the CLI.
CLI version: v3.7.0
hyp create hyp-pytorch-jobrejects configurations that specify bothnode_countand explicit resource fields (accelerators,vcpu,memory). The CLI raises:But both are needed simultaneously:
node_countcontrols the number of replicas, while the resource fields control per-pod requests/limits. The underlyingHyperPodPyTorchJobCRD supports both together.Without explicit resource requests, the operator can auto-calculate resource requests that exceed what's actually available after system pod overhead, causing Kueue to never admit the job. This can make multi-node jobs with Kueue scheduling unusable through the CLI.
CLI version: v3.7.0