Overview

Overview#

Transform your AI/ML development process with Amazon SageMaker HyperPod CLI and SDK. These tools handle infrastructure management complexities, allowing you to focus on model development and innovation. Whether it’s scaling your PyTorch training jobs across thousands of GPUs, deploying production-grade inference endpoints or managing multiple clusters efficiently; the intuitive command-line interface and programmatic control enable you to:

Accelerate development cycles and reduce operational overhead
Automate ML workflows while maintaining operational visibility
Optimize computing resources across your AI/ML projects

Note

Version Info - you’re viewing latest documentation for SageMaker Hyperpod CLI and SDK v3.0.0.

What’s New

🚀 We are excited to announce general availability of Amazon SageMaker HyperPod CLI and SDK!

Major Updates:

Distributed Training: Scale PyTorch jobs across multiple nodes and GPUs with simplified management and automatic fault tolerance.
Model Inference: Deploy pre-trained models from SageMaker JumpStart and host custom auto-scaling inference endpoints.
Observability: Connect to and manage multiple HyperPod clusters with enhanced monitoring capabilities.
Usability Improvements: Intuitive CLI for quick experimentation and cluster management, granular SDK control over workload configurations and easy access to system logs and observability dashboards for efficient debugging