Cluster Management#

This guide will help you create and manage your first HyperPod cluster using the CLI.

Prerequisites#

Before you begin, ensure you have:

  • An AWS account with appropriate permissions for SageMaker HyperPod

  • AWS CLI configured with your credentials

  • HyperPod CLI installed (pip install sagemaker-hyperpod)

Note

Region Configuration: For commands that accept the --region option, if no region is explicitly provided, the command will use the default region from your AWS credentials configuration.

Cluster stack names must be unique within each AWS region. If you attempt to create a cluster stack with a name that already exists in the same region, the deployment will fail.

Creating Your First Cluster#

1. Start with a Clean Directory#

It’s recommended to start with a new and clean directory for each cluster configuration:

mkdir my-hyperpod-cluster
cd my-hyperpod-cluster

2. Initialize a New Cluster Configuration#

hyp init cluster-stack

This creates three files:

  • config.yaml: The main configuration file you’ll use to customize your cluster

  • cfn_params.jinja: A reference template for CloudFormation parameters

  • README.md: Usage guide with instructions and examples

Important

The resource_name_prefix parameter in the generated config.yaml file serves as the primary identifier for all AWS resources created during deployment. Each deployment must use a unique resource name prefix to avoid conflicts. This prefix is automatically appended with a unique identifier during cluster creation to ensure resource uniqueness.

3. Configure Your Cluster#

You can configure your cluster in two ways:

Option 1: Edit config.yaml directly

The config.yaml file contains key parameters like:

template: cluster-stack
namespace: kube-system
stage: gamma
resource_name_prefix: sagemaker-hyperpod-eks

Option 2: Use CLI/SDK commands (Pre-Deployment)

hyp configure --resource-name-prefix your-resource-prefix

Note

The hyp configure command only modifies local configuration files. It does not affect existing deployed clusters.

4. Create the Cluster#

Warning

Cluster Stack Name Uniqueness: Cluster stack names must be unique within each AWS region. Ensure your resource_name_prefix in config.yaml generates a unique stack name for the target region to avoid deployment conflicts.

hyp create --region your-region

This will:

  • Validate your configuration

  • Create a timestamped folder in the run directory

  • Initialize the cluster creation process

5. Monitor Your Cluster#

Check the status of your cluster:

hyp describe cluster-stack your-cluster-name --region your-region
from sagemaker.hyperpod.cluster_management.hp_cluster_stack import HpClusterStack

# Describe a specific cluster stack
response = HpClusterStack.describe("your-cluster-name", region="your-region")
print(f"Stack Status: {response['Stacks'][0]['StackStatus']}")
print(f"Stack Name: {response['Stacks'][0]['StackName']}")

Note

Region-Specific Stack Names: Cluster stack names are unique within each AWS region. When describing a stack, ensure you specify the correct region where the stack was created, or the command will fail to find the stack.

List all clusters:

hyp list cluster-stack --region your-region
from sagemaker.hyperpod.cluster_management.hp_cluster_stack import HpClusterStack

# List all CloudFormation stacks (including cluster stacks)
stacks = HpClusterStack.list(region="your-region")
for stack in stacks['StackSummaries']:
   print(f"Stack: {stack['StackName']}, Status: {stack['StackStatus']}")

Common Operations#

Update a Cluster#

Important

Runtime vs Configuration Commands:

  • hyp update cluster modifies existing, deployed clusters (runtime settings like instance groups, node recovery)

  • hyp configure modifies local config.yaml files before cluster creation

Use the appropriate command based on whether your cluster is already deployed or not.

hyp update cluster \
    --cluster-name your-cluster-name \
    --instance-groups "[]" \
    --region your-region

Reset Configuration#

hyp reset

Best Practices#

  • Always validate your configuration before submission:

    hyp validate
    

    Note

    This command performs syntactic validation only of the config.yaml file against the appropriate schema. It checks:

    • YAML syntax: Ensures file is valid YAML

    • Required fields: Verifies all mandatory fields are present

    • Data types: Confirms field values match expected types (string, number, boolean, array)

    • Schema structure: Validates against the template’s defined structure

    This command performs syntactic validation only and does not verify the actual validity of values (e.g., whether AWS regions exist, instance types are available, or resources can be created).

  • Use meaningful resource prefixes to easily identify your clusters

  • Monitor cluster status regularly after creation

  • Keep your configuration files in version control for reproducibility

Next Steps#

After creating your cluster, you can:

  • Connect to your cluster:

    hyp set-cluster-context --cluster-name your-cluster-name
    
  • Start training jobs with PyTorch

  • Deploy inference endpoints

  • Monitor cluster resources and performance

For more detailed information on specific commands, use the --help flag:

hyp <command> --help

Cluster Management Example Notebooks#

For detailed examples of training with HyperPod, see:

These examples demonstrate end-to-end workflows for creating and managing cluster stacks using both the CLI and SDK approaches.