Cluster Management#
- class sagemaker.hyperpod.cluster_management.hp_cluster_stack.HpClusterStack[source]#
Manages SageMaker HyperPod cluster CloudFormation stacks.
This class provides functionality to create, manage, and monitor CloudFormation stacks for SageMaker HyperPod clusters. It extends ClusterStackBase with stack lifecycle operations.
Usage Examples
>>> # Create a cluster stack instance >>> stack = HpClusterStack() >>> response = stack.create(region="us-west-2") >>> >>> # Check stack status >>> status = stack.get_status() >>> print(status)
- create(region: str | None = None, template_version: int | None = 1) str[source]#
Creates a new HyperPod cluster CloudFormation stack.
Parameters:
Parameter
Type
Description
region
str, optional
AWS region for stack creation. Uses current session region if not specified
Returns:
dict: CloudFormation describe_stacks response containing stack details
Raises:
Exception: When CloudFormation stack creation fails
Usage Examples
>>> # Create stack in default region >>> stack = HpClusterStack() >>> response = stack.create() >>> >>> # Create stack in specific region >>> response = stack.create(region="us-east-1")
- static describe(stack_name, region: str | None = None)[source]#
Describes a CloudFormation stack by name.
Note
Stack descriptions are region-specific. You must use the correct region where the stack was created to retrieve its description.
Parameters:
Parameter
Type
Description
stack_name
str
Name of the CloudFormation stack to describe. For ARN format arn:aws:cloudformation:region:account:stack/stack-name/stack-id, use the stack-name part
region
str, optional
AWS region where the stack exists
Returns:
dict: CloudFormation describe_stacks response
Raises:
ValueError: When stack is not accessible or doesn’t exist RuntimeError: When CloudFormation operation fails
Usage Examples
>>> # Describe a stack by name >>> response = HpClusterStack.describe("my-stack-name") >>> >>> # Describe stack in specific region >>> response = HpClusterStack.describe("my-stack", region="us-west-2")
- static list(region: str | None = None, stack_status_filter: List[str] | None = None)[source]#
Lists all CloudFormation stacks in the specified region.
Note
Stack listings are region-specific. If no region is provided, uses the default region from your AWS configuration.
Parameters:
Parameter
Type
Description
region
str, optional
AWS region to list stacks from. Uses default region if not specified
Returns:
dict: CloudFormation list_stacks response containing stack summaries
Raises:
ValueError: When insufficient permissions to list stacks RuntimeError: When CloudFormation list operation fails
Usage Examples
>>> # List stacks in current region >>> stacks = HpClusterStack.list() >>> >>> # List stacks in specific region >>> stacks = HpClusterStack.list(region="us-east-1")
- get_status(region: str | None = None)[source]#
Gets the status of the current stack instance.
Parameters:
Parameter
Type
Description
region
str, optional
AWS region where the stack exists
Returns:
str: CloudFormation stack status (e.g., ‘CREATE_COMPLETE’, ‘UPDATE_IN_PROGRESS’)
Raises:
ValueError: When stack hasn’t been created yet (call create() first)
Usage Examples
>>> # Create stack first, then check status >>> stack = HpClusterStack() >>> stack.create() >>> status = stack.get_status() >>> print(f"Stack status: {status}")
- static check_status(stack_name: str, region: str | None = None)[source]#
Checks the status of any CloudFormation stack by name.
Parameters:
Parameter
Type
Description
stack_name
str
Name of the CloudFormation stack
region
str, optional
AWS region where the stack exists
Returns:
str: CloudFormation stack status or None if stack not found
Usage Examples
>>> # Check status of any stack >>> status = HpClusterStack.check_status("my-stack-name") >>> >>> # Check status in specific region >>> status = HpClusterStack.check_status("my-stack", region="us-west-2")
- static delete(stack_name: str, region: str | None = None, retain_resources: List[str] | None = None, logger: Logger | None = None) None[source]#
Deletes a HyperPod cluster CloudFormation stack.
Removes the specified CloudFormation stack and all associated AWS resources. This operation cannot be undone and proceeds automatically without confirmation.
Parameters:
Parameter
Type
Description
stack_name
str
Name of the CloudFormation stack to delete
region
str, optional
AWS region where the stack exists
retain_resources
List[str], optional
List of logical resource IDs to retain during deletion (only works on DELETE_FAILED stacks)
logger
logging.Logger, optional
Logger instance for output messages. Uses default logger if not provided
Raises:
ValueError: When stack doesn’t exist or retain_resources limitation is encountered RuntimeError: When CloudFormation deletion fails Exception: For other deletion errors
Usage Examples
>>> # Delete a stack (automatically proceeds without confirmation) >>> HpClusterStack.delete("my-stack-name") >>> >>> # Delete in specific region >>> HpClusterStack.delete("my-stack-name", region="us-west-2") >>> >>> # Delete with retained resources (only works on DELETE_FAILED stacks) >>> HpClusterStack.delete("my-stack-name", retain_resources=["S3Bucket", "EFSFileSystem"]) >>> >>> # Delete with custom logger >>> import logging >>> logger = logging.getLogger(__name__) >>> HpClusterStack.delete("my-stack-name", logger=logger)
SageMaker Core Cluster Update Method#
The cluster management also supports updating cluster properties using the SageMaker Core Cluster update method from sagemaker_core.main.resources:
- Cluster.update(instance_groups=None, restricted_instance_groups=None, node_recovery=None, instance_groups_to_delete=None)#
Update a SageMaker Core Cluster resource.
Parameters:
Parameter
Type
Description
instance_groups
List[ClusterInstanceGroupSpecification]
List of instance group specifications to update
restricted_instance_groups
List[ClusterRestrictedInstanceGroupSpecification]
List of restricted instance group specifications
node_recovery
str
Node recovery setting (“Automatic” or “None”)
instance_groups_to_delete
List[str]
List of instance group names to delete
Returns:
The updated Cluster resource
Raises:
botocore.exceptions.ClientError: AWS service related errorsConflictException: Conflict when modifying SageMaker entityResourceLimitExceeded: SageMaker resource limit exceededResourceNotFound: Resource being accessed is not found
Usage Examples
from sagemaker_core.main.resources import Cluster from sagemaker_core.main.shapes import ClusterInstanceGroupSpecification # Get existing cluster cluster = Cluster.get(cluster_name="my-cluster") # Update cluster with new instance groups and node recovery cluster.update( instance_groups=[ ClusterInstanceGroupSpecification( InstanceCount=2, InstanceGroupName="worker-nodes", InstanceType="ml.m5.large" ) ], node_recovery="Automatic", instance_groups_to_delete=["old-group-name"] )