Inference with SageMaker HyperPod#
SageMaker HyperPod provides powerful capabilities for deploying and managing inference endpoints on EKS-hosted clusters. This guide covers how to create, invoke, and manage inference endpoints using both the HyperPod CLI and SDK.
Overview#
SageMaker HyperPod inference endpoints allow you to:
Deploy pre-trained JumpStart models
Deploy custom models with your own inference code
Configure resource requirements for inference
Manage endpoint lifecycle
Invoke endpoints for real-time predictions
Monitor endpoint performance
Creating Inference Endpoints – CLI Init Experience#
The following is a step-by-step guide of creating a jumpstart endpoint with hyp-jumpstart-endpoint template for init experience. In order to create a custom endpoint, you can use hyp-custom-endpoint template during the init command call. The init experience is the same across templates.
1. Start with a Clean Directory#
It’s recommended to start with a new and clean directory for each endpoint configuration:
mkdir my-endpoint
cd my-endpoint
2. Initialize a New Endpoint Configuration#
hyp init hyp-jumpstart-endpoint
Note
In order to create custom endpoint, you can simply use hyp init hyp-custom-endpoint.
This creates three files:
config.yaml: The main configuration file you’ll use to customize your endpointk8s.jinja: A reference template for parameters mapping in kubernetes payloadREADME.md: Usage guide with instructions and examples
3. Configure Your Endpoint#
You can configure your endpoint in two ways:
Option 1: Edit config.yaml directly
The config.yaml file contains key parameters like:
template: hyp-jumpstart-endpoint
version: 1.0
model_id:
instance_type:
endpoint_name:
Option 2: Use CLI command (Pre-Deployment)
hyp configure --endpoint-name your-endpoint-name
Note
The hyp configure command only modifies local configuration files. It
does not affect existing deployed endpoints.
4. Create the Endpoint#
hyp create
This will:
Validate your configuration
Create a timestamped folder in the
rundirectoryInitialize the endpoint creation process
Creating Inference Endpoints – CLI/SDK#
You can create inference endpoints using either JumpStart models or custom models:
JumpStart Model Endpoints#
hyp create hyp-jumpstart-endpoint \
--model-id jumpstart-model-id \
--instance-type ml.g5.8xlarge \
--endpoint-name endpoint-jumpstart
from sagemaker.hyperpod.inference.config.hp_jumpstart_endpoint_config import Model, Server, SageMakerEndpoint, TlsConfig
from sagemaker.hyperpod.inference.hp_jumpstart_endpoint import HPJumpStartEndpoint
model = Model(
model_id="deepseek-llm-r1-distill-qwen-1-5b"
)
server = Server(
instance_type="ml.g5.8xlarge"
)
endpoint_name = SageMakerEndpoint(name="endpoint-jumpstart")
js_endpoint = HPJumpStartEndpoint(
model=model,
server=server,
sage_maker_endpoint=endpoint_name
)
js_endpoint.create()
Custom Model Endpoints#
hyp create hyp-custom-endpoint \
--version 1.0 \
--endpoint-name endpoint-s3 \
--model-name <model-name> \
--model-source-type s3 \
--instance-type <instance-type> \
--image-uri <image-uri> \
--container-port 8080 \
--model-volume-mount-name model-weights
from sagemaker.hyperpod.inference.config.hp_endpoint_config import CloudWatchTrigger, Dimensions, AutoScalingSpec, Metrics, S3Storage, ModelSourceConfig, TlsConfig, EnvironmentVariables, ModelInvocationPort, ModelVolumeMount, Resources, Worker
from sagemaker.hyperpod.inference.hp_endpoint import HPEndpoint
model_source_config = ModelSourceConfig(
model_source_type='s3',
model_location="<my-model-folder-in-s3>",
s3_storage=S3Storage(
bucket_name='<my-model-artifacts-bucket>',
region='us-east-2',
),
)
environment_variables = [
EnvironmentVariables(name="HF_MODEL_ID", value="/opt/ml/model"),
EnvironmentVariables(name="SAGEMAKER_PROGRAM", value="inference.py"),
EnvironmentVariables(name="SAGEMAKER_SUBMIT_DIRECTORY", value="/opt/ml/model/code"),
EnvironmentVariables(name="MODEL_CACHE_ROOT", value="/opt/ml/model"),
EnvironmentVariables(name="SAGEMAKER_ENV", value="1"),
]
worker = Worker(
image='763104351884.dkr.ecr.us-east-2.amazonaws.com/huggingface-pytorch-tgi-inference:2.4.0-tgi2.3.1-gpu-py311-cu124-ubuntu22.04-v2.0',
model_volume_mount=ModelVolumeMount(
name='model-weights',
),
model_invocation_port=ModelInvocationPort(container_port=8080),
resources=Resources(
requests={"cpu": "30000m", "nvidia.com/gpu": 1, "memory": "100Gi"},
limits={"nvidia.com/gpu": 1}
),
environment_variables=environment_variables,
)
tls_config=TlsConfig(tls_certificate_output_s3_uri='s3://<my-tls-bucket-name>')
custom_endpoint = HPEndpoint(
endpoint_name='<my-endpoint-name>',
instance_type='ml.g5.8xlarge',
model_name='deepseek15b-test-model-name',
tls_config=tls_config,
model_source_config=model_source_config,
worker=worker,
)
custom_endpoint.create()
Key Parameters#
When creating an inference endpoint, you’ll need to specify:
Parameters required for Jumpstart Endpoint
Parameter |
Type |
Required |
Description |
|---|---|---|---|
endpoint-name |
TEXT |
Yes |
Unique identifier for your endpoint |
instance-type |
TEXT |
Yes |
The EC2 instance type to use |
model-id |
TEXT |
Yes |
ID of the pre-trained JumpStart model |
Parameters required for Custom Endpoint
Parameter |
Type |
Required |
Description |
|---|---|---|---|
endpoint-name |
TEXT |
Yes |
Unique identifier for your endpoint |
instance-type |
TEXT |
Yes |
The EC2 instance type to use |
image-uri |
TEXT |
Yes |
Docker image containing your inference code |
model-name |
TEXT |
Yes |
Name of model to create on SageMaker |
model-source-type |
TEXT |
Yes |
Source type: fsx or s3 |
model-volume-mount-name |
TEXT |
Yes |
Name of the model volume mount |
container-port |
INTEGER |
Yes |
Port on which the model server listens |
Managing Inference Endpoints#
List Endpoints#
# List JumpStart endpoints
hyp list hyp-jumpstart-endpoint
# List custom endpoints
hyp list hyp-custom-endpoint
from sagemaker.hyperpod.inference.hp_jumpstart_endpoint import HPJumpStartEndpoint
from sagemaker.hyperpod.inference.hp_endpoint import HPEndpoint
# List JumpStart endpoints
jumpstart_endpoints = HPJumpStartEndpoint.list()
print(jumpstart_endpoints)
# List custom endpoints
custom_endpoints = HPEndpoint.list()
print(custom_endpoints)
Describe an Endpoint#
# Describe JumpStart endpoint
hyp describe hyp-jumpstart-endpoint --name <endpoint-name>
# Describe custom endpoint
hyp describe hyp-custom-endpoint --name <endpoint-name>
from sagemaker.hyperpod.inference.hp_jumpstart_endpoint import HPJumpStartEndpoint
from sagemaker.hyperpod.inference.hp_endpoint import HPEndpoint
# Get JumpStart endpoint details
jumpstart_endpoint = HPJumpStartEndpoint.get(name="js-endpoint-name", namespace="test")
print(jumpstart_endpoint)
# Get custom endpoint details
custom_endpoint = HPEndpoint.get(name="endpoint-custom")
print(custom_endpoint)
Invoke an Endpoint#
# Invoke Jumpstart endpoint
hyp invoke hyp-jumpstart-endpoint \
--endpoint-name <endpoint-name> \
--body '{"inputs":"What is the capital of USA?"}'
# Invoke custom endpoint
hyp invoke hyp-custom-endpoint \
--endpoint-name <endpoint-name> \
--body '{"inputs": "What is machine learning?"}'
from sagemaker.hyperpod.inference.hp_jumpstart_endpoint import HPJumpStartEndpoint
from sagemaker.hyperpod.inference.hp_endpoint import HPEndpoint
data = '{"inputs":"What is the capital of USA?"}'
jumpstart_endpoint = HPJumpStartEndpoint.get(name="endpoint-jumpstart")
response = jumpstart_endpoint.invoke(body=data).body.read()
print(response)
custom_endpoint = HPEndpoint.get(name="endpoint-custom")
response = custom_endpoint.invoke(body=data).body.read()
print(response)
List Pods#
# JumpStart endpoint
hyp list-pods hyp-jumpstart-endpoint
# Custom endpoint
hyp list-pods hyp-custom-endpoint
from sagemaker.hyperpod.inference.hp_jumpstart_endpoint import HPJumpStartEndpoint
from sagemaker.hyperpod.inference.hp_endpoint import HPEndpoint
# List pods
js_pods = HPJumpStartEndpoint.list_pods()
print(js_pods)
c_pods = HPEndpoint.list_pods()
print(c_pods)
Get Logs#
# JumpStart endpoint
hyp get-logs hyp-jumpstart-endpoint --pod-name <pod-name>
# Custom endpoint
hyp get-logs hyp-custom-endpoint --pod-name <pod-name>
from sagemaker.hyperpod.inference.hp_jumpstart_endpoint import HPJumpStartEndpoint
from sagemaker.hyperpod.inference.hp_endpoint import HPEndpoint
# Get logs from pod
js_logs = HPJumpStartEndpoint.get_logs(pod=<pod-name>)
print(js_logs)
c_logs = HPEndpoint.get_logs(pod=<pod-name>)
print(c_logs)
Get Operator Logs#
# JumpStart endpoint
hyp get-operator-logs hyp-jumpstart-endpoint --since-hours 0.5
# Custom endpoint
hyp get-operator-logs hyp-custom-endpoint --since-hours 0.5
from sagemaker.hyperpod.inference.hp_jumpstart_endpoint import HPJumpStartEndpoint
from sagemaker.hyperpod.inference.hp_endpoint import HPEndpoint
# Invoke JumpStart endpoint
print(HPJumpStartEndpoint.get_operator_logs(since_hours=0.1))
# Invoke custom endpoint
print(HPEndpoint.get_operator_logs(since_hours=0.1))
Delete an Endpoint#
# Delete JumpStart endpoint
hyp delete hyp-jumpstart-endpoint --name <endpoint-name>
# Delete custom endpoint
hyp delete hyp-custom-endpoint --name <endpoint-name>
from sagemaker.hyperpod.inference.hp_jumpstart_endpoint import HPJumpStartEndpoint
from sagemaker.hyperpod.inference.hp_endpoint import HPEndpoint
# Delete JumpStart endpoint
jumpstart_endpoint = HPJumpStartEndpoint.get(name="endpoint-jumpstart")
jumpstart_endpoint.delete()
# Delete custom endpoint
custom_endpoint = HPEndpoint.get(name="endpoint-custom")
custom_endpoint.delete()
Inference Example Notebooks#
For detailed examples of inference with HyperPod, explore these interactive Jupyter notebooks:
CLI Examples:
SDK Examples:
These Jupyter notebooks demonstrate comprehensive workflows for deploying and managing inference endpoints using different model storage options and both CLI and SDK approaches. You can run these notebooks directly in your local environment or SageMaker Studio.