Cluster Management#

class sagemaker.hyperpod.cluster_management.hp_cluster_stack.HpClusterStack[source]#

Manages SageMaker HyperPod cluster CloudFormation stacks.

This class provides functionality to create, manage, and monitor CloudFormation stacks for SageMaker HyperPod clusters. It extends ClusterStackBase with stack lifecycle operations.

Usage Examples
>>> # Create a cluster stack instance
>>> stack = HpClusterStack()
>>> response = stack.create(region="us-west-2")
>>>
>>> # Check stack status
>>> status = stack.get_status()
>>> print(status)
create(region: str | None = None, template_version: int | None = 1) str[source]#

Creates a new HyperPod cluster CloudFormation stack.

Parameters:

Parameter

Type

Description

region

str, optional

AWS region for stack creation. Uses current session region if not specified

Returns:

dict: CloudFormation describe_stacks response containing stack details

Raises:

Exception: When CloudFormation stack creation fails

Usage Examples
>>> # Create stack in default region
>>> stack = HpClusterStack()
>>> response = stack.create()
>>>
>>> # Create stack in specific region
>>> response = stack.create(region="us-east-1")
static describe(stack_name, region: str | None = None)[source]#

Describes a CloudFormation stack by name.

Note

Stack descriptions are region-specific. You must use the correct region where the stack was created to retrieve its description.

Parameters:

Parameter

Type

Description

stack_name

str

Name of the CloudFormation stack to describe. For ARN format arn:aws:cloudformation:region:account:stack/stack-name/stack-id, use the stack-name part

region

str, optional

AWS region where the stack exists

Returns:

dict: CloudFormation describe_stacks response

Raises:

ValueError: When stack is not accessible or doesn’t exist RuntimeError: When CloudFormation operation fails

Usage Examples
>>> # Describe a stack by name
>>> response = HpClusterStack.describe("my-stack-name")
>>>
>>> # Describe stack in specific region
>>> response = HpClusterStack.describe("my-stack", region="us-west-2")
static list(region: str | None = None, stack_status_filter: List[str] | None = None)[source]#

Lists all CloudFormation stacks in the specified region.

Note

Stack listings are region-specific. If no region is provided, uses the default region from your AWS configuration.

Parameters:

Parameter

Type

Description

region

str, optional

AWS region to list stacks from. Uses default region if not specified

Returns:

dict: CloudFormation list_stacks response containing stack summaries

Raises:

ValueError: When insufficient permissions to list stacks RuntimeError: When CloudFormation list operation fails

Usage Examples
>>> # List stacks in current region
>>> stacks = HpClusterStack.list()
>>>
>>> # List stacks in specific region
>>> stacks = HpClusterStack.list(region="us-east-1")
get_status(region: str | None = None)[source]#

Gets the status of the current stack instance.

Parameters:

Parameter

Type

Description

region

str, optional

AWS region where the stack exists

Returns:

str: CloudFormation stack status (e.g., ‘CREATE_COMPLETE’, ‘UPDATE_IN_PROGRESS’)

Raises:

ValueError: When stack hasn’t been created yet (call create() first)

Usage Examples
>>> # Create stack first, then check status
>>> stack = HpClusterStack()
>>> stack.create()
>>> status = stack.get_status()
>>> print(f"Stack status: {status}")
static check_status(stack_name: str, region: str | None = None)[source]#

Checks the status of any CloudFormation stack by name.

Parameters:

Parameter

Type

Description

stack_name

str

Name of the CloudFormation stack

region

str, optional

AWS region where the stack exists

Returns:

str: CloudFormation stack status or None if stack not found

Usage Examples
>>> # Check status of any stack
>>> status = HpClusterStack.check_status("my-stack-name")
>>>
>>> # Check status in specific region
>>> status = HpClusterStack.check_status("my-stack", region="us-west-2")
static delete(stack_name: str, region: str | None = None, retain_resources: List[str] | None = None, logger: Logger | None = None) None[source]#

Deletes a HyperPod cluster CloudFormation stack.

Removes the specified CloudFormation stack and all associated AWS resources. This operation cannot be undone and proceeds automatically without confirmation.

Parameters:

Parameter

Type

Description

stack_name

str

Name of the CloudFormation stack to delete

region

str, optional

AWS region where the stack exists

retain_resources

List[str], optional

List of logical resource IDs to retain during deletion (only works on DELETE_FAILED stacks)

logger

logging.Logger, optional

Logger instance for output messages. Uses default logger if not provided

Raises:

ValueError: When stack doesn’t exist or retain_resources limitation is encountered RuntimeError: When CloudFormation deletion fails Exception: For other deletion errors

Usage Examples
>>> # Delete a stack (automatically proceeds without confirmation)
>>> HpClusterStack.delete("my-stack-name")
>>>
>>> # Delete in specific region
>>> HpClusterStack.delete("my-stack-name", region="us-west-2")
>>>
>>> # Delete with retained resources (only works on DELETE_FAILED stacks)
>>> HpClusterStack.delete("my-stack-name", retain_resources=["S3Bucket", "EFSFileSystem"])
>>>
>>> # Delete with custom logger
>>> import logging
>>> logger = logging.getLogger(__name__)
>>> HpClusterStack.delete("my-stack-name", logger=logger)

SageMaker Core Cluster Update Method#

The cluster management also supports updating cluster properties using the SageMaker Core Cluster update method from sagemaker_core.main.resources:

Cluster.update(instance_groups=None, restricted_instance_groups=None, node_recovery=None, instance_groups_to_delete=None)#

Update a SageMaker Core Cluster resource.

Parameters:

Parameter

Type

Description

instance_groups

List[ClusterInstanceGroupSpecification]

List of instance group specifications to update

restricted_instance_groups

List[ClusterRestrictedInstanceGroupSpecification]

List of restricted instance group specifications

node_recovery

str

Node recovery setting (“Automatic” or “None”)

instance_groups_to_delete

List[str]

List of instance group names to delete

Returns:

The updated Cluster resource

Raises:

  • botocore.exceptions.ClientError: AWS service related errors

  • ConflictException: Conflict when modifying SageMaker entity

  • ResourceLimitExceeded: SageMaker resource limit exceeded

  • ResourceNotFound: Resource being accessed is not found

Usage Examples
from sagemaker_core.main.resources import Cluster
from sagemaker_core.main.shapes import ClusterInstanceGroupSpecification

# Get existing cluster
cluster = Cluster.get(cluster_name="my-cluster")

# Update cluster with new instance groups and node recovery
cluster.update(
    instance_groups=[
        ClusterInstanceGroupSpecification(
            InstanceCount=2,
            InstanceGroupName="worker-nodes",
            InstanceType="ml.m5.large"
        )
    ],
    node_recovery="Automatic",
    instance_groups_to_delete=["old-group-name"]
)