Overview#
Transform your AI/ML development process with Amazon SageMaker HyperPod CLI and SDK. These tools handle infrastructure management complexities, allowing you to focus on model development and innovation. Whether it’s scaling your PyTorch training jobs across thousands of GPUs, deploying production-grade inference endpoints or managing multiple clusters efficiently; the intuitive command-line interface and programmatic control enable you to:
Accelerate development cycles and reduce operational overhead
Automate ML workflows while maintaining operational visibility
Optimize computing resources across your AI/ML projects
Note
Version Info - you’re viewing latest documentation for SageMaker Hyperpod CLI and SDK v3.0.0.
What’s New
🚀 We are excited to announce general availability of Amazon SageMaker HyperPod CLI and SDK!
Major Updates:
Distributed Training: Scale PyTorch jobs across multiple nodes and GPUs with simplified management and automatic fault tolerance.
Model Inference: Deploy pre-trained models from SageMaker JumpStart and host custom auto-scaling inference endpoints.
Observability: Connect to and manage multiple HyperPod clusters with enhanced monitoring capabilities.
Usability Improvements: Intuitive CLI for quick experimentation and cluster management, granular SDK control over workload configurations and easy access to system logs and observability dashboards for efficient debugging
Quick Start#
New to HyperPod? Install the CLI/ SDK in minutes.
Ready to explore? Connect to your cluster before running ML workflows.
Scale Your ML Models! Get started with training
Deploy Your ML Model! Get started with inference
Advanced Resources#
Explore APIs - Checkout API Documentation
Example Notebooks - Ready-to-use implementation guides
HyperPod Documentation - Know more about HyperPod
Developer Guide - Refer to this practical development guide
Practical Guide - Refer to the workshop for detailed follow-through steps