Cluster Management#
Complete reference for SageMaker HyperPod cluster management parameters and configuration options.
Note
Region Configuration: For commands that accept the --region option, if no region is explicitly provided, the command will use the default region from your AWS credentials configuration.
hyp init#
Initialize a template scaffold in the current directory.
Syntax#
hyp init TEMPLATE [DIRECTORY] [OPTIONS]
Parameters#
Parameter |
Type |
Required |
Description |
|---|---|---|---|
|
CHOICE |
Yes |
Template type (cluster-stack, hyp-pytorch-job, hyp-custom-endpoint, hyp-jumpstart-endpoint) |
|
PATH |
No |
Target directory (default: current directory) |
|
TEXT |
No |
Schema version to use |
Important
The resource_name_prefix parameter in the generated config.yaml file serves as the primary identifier for all AWS resources created during deployment. Each deployment must use a unique resource name prefix to avoid conflicts. This prefix is automatically appended with a unique identifier during cluster creation to ensure resource uniqueness.
Cluster stack names must be unique within each AWS region. If you attempt to create a cluster stack with a name that already exists in the same region, the deployment will fail.
hyp configure#
Configure cluster parameters interactively or via command line.
Important
Pre-Deployment Configuration: This command modifies local config.yaml files before cluster creation. For updating existing, deployed clusters, use hyp update cluster instead.
Syntax#
hyp configure [OPTIONS]
Parameters#
This command dynamically supports all configuration parameters available in the current template’s schema. Common parameters include:
Parameter |
Type |
Required |
Description |
|---|---|---|---|
|
TEXT |
No |
Prefix for all AWS resources |
|
BOOLEAN |
No |
Create HyperPod Cluster Stack |
|
TEXT |
No |
Name of SageMaker HyperPod Cluster |
|
BOOLEAN |
No |
Create EKS Cluster Stack |
|
TEXT |
No |
Kubernetes version |
|
TEXT |
No |
Name of the EKS cluster |
|
BOOLEAN |
No |
Create Helm Chart Stack |
|
TEXT |
No |
Namespace to deploy HyperPod Helm chart |
|
TEXT |
No |
Continuous provisioning mode |
|
TEXT |
No |
Node recovery setting (“Automatic” or “None”) |
|
BOOLEAN |
No |
Create VPC Stack |
|
TEXT |
No |
Existing VPC ID |
|
TEXT |
No |
VPC CIDR block |
|
BOOLEAN |
No |
Create Security Group Stack |
|
BOOLEAN |
No |
Enable inference operator |
|
TEXT |
No |
Deployment stage (“gamma” or “prod”) |
|
BOOLEAN |
No |
Create FSx Stack |
|
INTEGER |
No |
FSx storage capacity in GiB |
|
JSON |
No |
Resource tags as JSON object |
Note: The exact parameters available depend on your current template type and version. Run hyp configure --help to see all available options for your specific configuration.
hyp validate#
Validate the current directory’s configuration file syntax and structure.
Syntax#
# Validate current configuration syntax
hyp validate
# Example output on success
✔️ config.yaml is valid!
# Example output with syntax errors
❌ Config validation errors:
– kubernetes_version: Field is required
– vpc_cidr: Expected string, got number
Parameters#
No parameters required.
Note
This command performs syntactic validation only of the config.yaml file against the appropriate schema. It checks:
YAML syntax: Ensures file is valid YAML
Required fields: Verifies all mandatory fields are present
Data types: Confirms field values match expected types (string, number, boolean, array)
Schema structure: Validates against the template’s defined structure
This command performs syntactic validation only and does not verify the actual validity of values (e.g., whether AWS regions exist, instance types are available, or resources can be created).
Prerequisites
Must be run in a directory where
hyp inithas created configuration filesA
config.yamlfile must exist in the current directory
Output
Success: Displays confirmation message if syntax is valid
Errors: Lists specific syntax errors with field names and descriptions
hyp reset#
Reset the current directory’s config.yaml to default values.
Syntax#
hyp reset
Parameters#
No parameters required.
hyp create#
Create a new HyperPod cluster stack using the provided configuration.
Syntax#
hyp create [OPTIONS]
Parameters#
Parameter |
Type |
Required |
Description |
|---|---|---|---|
|
TEXT |
No |
AWS region where the cluster stack will be created |
|
FLAG |
No |
Enable debug logging |
hyp update cluster#
Update an existing HyperPod cluster configuration.
Important
Runtime vs Configuration Commands: This command modifies an existing, deployed cluster’s runtime settings (instance groups, node recovery). This is different from hyp configure, which only modifies local configuration files before cluster creation.
Syntax#
hyp update cluster [OPTIONS]
Parameters#
Parameter |
Type |
Required |
Description |
|---|---|---|---|
|
TEXT |
Yes |
Name of the cluster to update |
|
TEXT |
No |
JSON string of instance group configurations |
|
TEXT |
No |
JSON string of instance groups to delete |
|
TEXT |
No |
AWS region of the cluster |
|
TEXT |
No |
Node recovery setting (Automatic or None) |
|
FLAG |
No |
Enable debug logging |
hyp list cluster-stack#
List all HyperPod cluster stacks (CloudFormation stacks).
Syntax#
hyp list cluster-stack [OPTIONS]
Parameters#
Parameter |
Type |
Required |
Description |
|---|---|---|---|
|
TEXT |
No |
AWS region to list stacks from |
|
TEXT |
No |
Filter by stack status. Format: “[‘CREATE_COMPLETE’, ‘UPDATE_COMPLETE’]” |
|
FLAG |
No |
Enable debug logging |
hyp describe cluster-stack#
Describe a specific HyperPod cluster stack.
Note
Region-Specific Stack Names: Cluster stack names are unique within each AWS region. When describing a stack, ensure you specify the correct region where the stack was created, or the command will fail to find the stack.
Syntax#
hyp describe cluster-stack STACK-NAME [OPTIONS]
Parameters#
Parameter |
Type |
Required |
Description |
|---|---|---|---|
|
TEXT |
Yes |
Name of the CloudFormation stack to describe |
|
TEXT |
No |
AWS region of the stack |
|
FLAG |
No |
Enable debug logging |
hyp delete cluster-stack#
Delete a HyperPod cluster stack. Removes the specified CloudFormation stack and all associated AWS resources. This operation cannot be undone.
Syntax#
hyp delete cluster-stack <stack-name>
Parameters#
Option |
Type |
Description |
|---|---|---|
|
Required |
The AWS region where the stack exists. |
|
Optional |
Comma-separated list of logical resource IDs to retain during deletion (only works on DELETE_FAILED stacks). Resource names are shown in failed deletion output, or use AWS CLI: |
|
Optional |
Enable debug mode for detailed logging. |
hyp list-cluster#
List SageMaker HyperPod clusters with capacity information.
Syntax#
hyp list-cluster [OPTIONS]
Parameters#
Parameter |
Type |
Required |
Description |
|---|---|---|---|
|
TEXT |
No |
AWS region to list clusters from |
|
TEXT |
No |
Output format (“table” or “json”, default: “json”) |
|
TEXT |
No |
Comma-separated list of specific cluster names |
|
TEXT |
No |
Namespace to check capacity for (can be used multiple times) |
|
FLAG |
No |
Enable debug logging |
hyp set-cluster-context#
Connect to a HyperPod EKS cluster and set kubectl context.
Syntax#
hyp set-cluster-context [OPTIONS]
Parameters#
Parameter |
Type |
Required |
Description |
|---|---|---|---|
|
TEXT |
Yes |
Name of the HyperPod cluster to connect to |
|
TEXT |
No |
AWS region of the cluster |
|
TEXT |
No |
Kubernetes namespace to connect to |
|
FLAG |
No |
Enable debug logging |
hyp get-cluster-context#
Get context information for the currently connected cluster.
Syntax#
hyp get-cluster-context [OPTIONS]
Parameters#
Parameter |
Type |
Required |
Description |
|---|---|---|---|
|
FLAG |
No |
Enable debug logging |
hyp get-monitoring#
Get monitoring configurations for the HyperPod cluster.
Syntax#
hyp get-monitoring [OPTIONS]
Parameters#
Parameter |
Type |
Required |
Description |
|---|---|---|---|
|
FLAG |
No |
Return Grafana dashboard URL |
|
FLAG |
No |
Return Prometheus workspace URL |
|
FLAG |
No |
Return list of available metrics |
Parameter Reference#
Common Parameters Across Commands#
Parameter |
Type |
Description |
Default |
|---|---|---|---|
|
TEXT |
AWS region |
Current AWS profile region |
|
FLAG |
Show command help |
- |
|
FLAG |
Enable verbose output |
false |
Configuration File Parameters#
The config.yaml file supports the following parameters:
Parameter |
Type |
Description |
Default |
|---|---|---|---|
|
TEXT |
Prefix for all AWS resources (4-digit UUID added during submission) |
“hyp-eks-stack” |
|
BOOLEAN |
Create HyperPod Cluster Stack |
true |
|
TEXT |
Name of SageMaker HyperPod Cluster |
“hyperpod-cluster” |
|
BOOLEAN |
Create EKS Cluster Stack |
true |
|
TEXT |
Kubernetes version |
“1.31” |
|
TEXT |
Name of the EKS cluster |
“eks-cluster” |
|
BOOLEAN |
Create Helm Chart Stack |
true |
|
TEXT |
Namespace to deploy HyperPod Helm chart |
“kube-system” |
|
TEXT |
URL of Helm repo containing HyperPod Helm chart |
“https://github.com/aws/sagemaker-hyperpod-cli.git” |
|
TEXT |
Path to HyperPod Helm chart in repo |
“helm_chart/HyperPodHelmChart” |
|
TEXT |
Configuration of HyperPod Helm chart |
“mlflow.enabled=true,trainingOperators.enabled=true,…” |
|
TEXT |
Name for Helm chart release |
“dependencies” |
|
TEXT |
Continuous provisioning mode (“Continuous” or empty) |
“Continuous” |
|
TEXT |
Automatic node recovery (“Automatic” or “None”) |
“Automatic” |
|
ARRAY |
List of instance group configurations |
[Default controller group] |
|
ARRAY |
Restricted instance group configurations |
null |
|
TEXT |
S3 bucket for RIG resources |
null |
|
ARRAY |
Custom tags for SageMaker HyperPod cluster |
null |
|
BOOLEAN |
Create VPC Stack |
true |
|
TEXT |
Existing VPC ID (if not creating new) |
null |
|
TEXT |
IP range for VPC |
“10.192.0.0/16” |
|
ARRAY |
List of AZs to deploy subnets |
null |
|
BOOLEAN |
Create Security Group Stack |
true |
|
TEXT |
Existing security group ID |
null |
|
ARRAY |
Security groups for HyperPod cluster |
null |
|
ARRAY |
Private subnet IDs for HyperPod cluster |
null |
|
ARRAY |
Private subnet IDs for EKS cluster |
null |
|
ARRAY |
NAT Gateway IDs for internet routing |
null |
|
ARRAY |
Private route table IDs |
null |
|
BOOLEAN |
Create S3 Endpoint stack |
true |
|
BOOLEAN |
Enable inference operator |
false |
|
TEXT |
Deployment stage (“gamma” or “prod”) |
“prod” |
|
TEXT |
Custom S3 bucket name for templates |
“” |
|
BOOLEAN |
Create Life Cycle Script Stack |
true |
|
BOOLEAN |
Create S3 Bucket Stack |
true |
|
TEXT |
S3 bucket for cluster lifecycle scripts |
“s3-bucket” |
|
TEXT |
Raw GitHub URL for lifecycle script |
“https://raw.githubusercontent.com/aws-samples/…” |
|
TEXT |
File name of lifecycle script |
“sagemaker-hyperpod-eks-bucket” |
|
BOOLEAN |
Create SageMaker IAM Role Stack |
true |
|
TEXT |
IAM role name for SageMaker cluster creation |
“create-cluster-role” |
|
BOOLEAN |
Create FSx Stack |
true |
|
TEXT |
Subnet ID for FSx creation |
“” |
|
TEXT |
Availability zone for FSx subnet |
“” |
|
INTEGER |
Per unit storage throughput |
250 |
|
TEXT |
Data compression type (“NONE” or “LZ4”) |
“NONE” |
|
FLOAT |
File system type version |
2.15 |
|
INTEGER |
Storage capacity in GiB |
1200 |
|
TEXT |
Existing FSx file system ID |
“” |
Note: The actual available configuration parameters depend on the specific template schema version. Use hyp init cluster-stack to see all available parameters for your version.
Examples#
Basic Cluster Stack Creation#
# Start with a clean directory
mkdir my-hyperpod-cluster
cd my-hyperpod-cluster
# Initialize cluster configuration
hyp init cluster-stack
# Configure basic parameters
hyp configure --resource-name-prefix my-cluster --stage prod
# Validate configuration
hyp validate
# Create cluster stack
hyp create --region us-west-2
Update Existing Cluster#
# Update instance groups
hyp update cluster \
--cluster-name my-cluster \
--instance-groups '[{"InstanceCount":2,"InstanceGroupName":"worker-nodes","InstanceType":"ml.m5.large"}]' \
--region us-west-2
List and Describe#
# List all cluster stacks
hyp list cluster-stack --region us-west-2
# Describe specific cluster stack
hyp describe cluster-stack my-stack-name --region us-west-2
# List HyperPod clusters with capacity info
hyp list-cluster --region us-west-2 --output table
# Connect to cluster
hyp set-cluster-context --cluster-name my-cluster --region us-west-2
# Get current context
hyp get-cluster-context