Training#
Complete reference for SageMaker HyperPod PyTorch training job parameters and configuration options.
Create Training Job – Init Experience#
hyp init#
Initialize a template scaffold in the current directory.
Syntax#
hyp init TEMPLATE [DIRECTORY] [OPTIONS]
Parameters#
Parameter |
Type |
Required |
Description |
|---|---|---|---|
|
CHOICE |
Yes |
Template type (cluster-stack, hyp-pytorch-job, hyp-custom-endpoint, hyp-jumpstart-endpoint) |
|
PATH |
No |
Target directory (default: current directory) |
|
TEXT |
No |
Schema version to use |
hyp configure#
Configure training job parameters interactively or via command line.
Important
Pre-Deployment Configuration: This command modifies local config.yaml files before job creation.
Syntax#
hyp configure [OPTIONS]
Parameters#
This command dynamically supports all configuration parameters available in the current template’s schema. Common parameters include:
Parameter |
Type |
Required |
Description |
|---|---|---|---|
|
TEXT |
Yes |
Unique name for the training job (1-63 characters, alphanumeric with hyphens) |
|
TEXT |
Yes |
Docker image URI containing your training code |
|
TEXT |
No |
Kubernetes namespace |
|
ARRAY |
No |
Command to run in the container (array of strings) |
|
ARRAY |
No |
Arguments for the entry script (array of strings) |
|
OBJECT |
No |
Environment variables as key-value pairs |
|
TEXT |
No |
Image pull policy (Always, Never, IfNotPresent) |
|
TEXT |
No |
Instance type for training |
|
INTEGER |
No |
Number of nodes (minimum: 1) |
|
INTEGER |
No |
Number of tasks per node (minimum: 1) |
|
OBJECT |
No |
Node label selector as key-value pairs |
|
BOOLEAN |
No |
Schedule pods only on nodes that passed deep health check (default: false) |
|
TEXT |
No |
Scheduler type |
|
TEXT |
No |
Queue name for job scheduling (1-63 characters, alphanumeric with hyphens) |
|
TEXT |
No |
Priority class for job scheduling |
|
INTEGER |
No |
Maximum number of job retries (minimum: 0) |
|
ARRAY |
No |
List of volume configurations (Refer Volume Configuration for detailed parameter info) |
|
TEXT |
No |
Service account name |
|
INTEGER |
No |
Number of accelerators a.k.a GPUs or Trainium Chips |
|
FLOAT |
No |
Number of vCPUs |
|
FLOAT |
No |
Amount of memory in GiB |
|
INTEGER |
No |
Limit for the number of accelerators a.k.a GPUs or Trainium Chips |
|
FLOAT |
No |
Limit for the number of vCPUs |
|
FLOAT |
No |
Limit for the amount of memory in GiB |
|
TEXT |
No |
Preferred topology annotation for scheduling |
|
TEXT |
No |
Required topology annotation for scheduling |
|
FLAG |
No |
Enable debug mode (default: false) |
Note: The exact parameters available depend on your current template type and version. Run hyp configure --help to see all available options for your specific configuration.
hyp validate#
Validate the current directory’s configuration file syntax and structure.
Syntax#
# Validate current configuration syntax
hyp validate
# Example output on success
✔️ config.yaml is valid!
# Example output with syntax errors
❌ Config validation errors:
– job_name: Field is required
Parameters#
No parameters required.
Note
This command performs syntactic validation only of the config.yaml file against the appropriate schema. It checks:
YAML syntax: Ensures file is valid YAML
Required fields: Verifies all mandatory fields are present
Data types: Confirms field values match expected types (string, number, boolean, array)
Schema structure: Validates against the template’s defined structure
This command performs syntactic validation only and does not verify the actual validity of values (e.g., whether AWS regions exist, instance types are available, or resources can be created).
Prerequisites
Must be run in a directory where
hyp inithas created configuration filesA
config.yamlfile must exist in the current directory
Output
Success: Displays confirmation message if syntax is valid
Errors: Lists specific syntax errors with field names and descriptions
hyp reset#
Reset the current directory’s config.yaml to default values.
Syntax#
hyp reset
Parameters#
No parameters required.
hyp create#
Create a new HyperPod training job using the provided configuration.
Syntax#
hyp create [OPTIONS]
Parameters#
Parameter |
Type |
Required |
Description |
|---|---|---|---|
|
FLAG |
No |
Enable debug logging |
Create Training Job – Direct Create#
hyp create hyp-pytorch-job#
Create distributed PyTorch training jobs on SageMaker HyperPod clusters.
Syntax#
hyp create hyp-pytorch-job [OPTIONS]
Parameters#
Parameter |
Type |
Required |
Description |
|---|---|---|---|
|
TEXT |
Yes |
Unique name for the training job (1-63 characters, alphanumeric with hyphens) |
|
TEXT |
Yes |
Docker image URI containing your training code |
|
TEXT |
No |
Kubernetes namespace |
|
ARRAY |
No |
Command to run in the container (array of strings) |
|
ARRAY |
No |
Arguments for the entry script (array of strings) |
|
OBJECT |
No |
Environment variables as key-value pairs |
|
TEXT |
No |
Image pull policy (Always, Never, IfNotPresent) |
|
TEXT |
No |
Instance type for training |
|
INTEGER |
No |
Number of nodes (minimum: 1) |
|
INTEGER |
No |
Number of tasks per node (minimum: 1) |
|
OBJECT |
No |
Node label selector as key-value pairs |
|
BOOLEAN |
No |
Schedule pods only on nodes that passed deep health check (default: false) |
|
TEXT |
No |
Scheduler type |
|
TEXT |
No |
Queue name for job scheduling (1-63 characters, alphanumeric with hyphens) |
|
TEXT |
No |
Priority class for job scheduling |
|
INTEGER |
No |
Maximum number of job retries (minimum: 0) |
|
ARRAY |
No |
List of volume configurations (Refer Volume Configuration for detailed parameter info) |
|
TEXT |
No |
Service account name |
|
INTEGER |
No |
Number of accelerators a.k.a GPUs or Trainium Chips |
|
FLOAT |
No |
Number of vCPUs |
|
FLOAT |
No |
Amount of memory in GiB |
|
INTEGER |
No |
Limit for the number of accelerators a.k.a GPUs or Trainium Chips |
|
FLOAT |
No |
Limit for the number of vCPUs |
|
FLOAT |
No |
Limit for the amount of memory in GiB |
|
TEXT |
No |
Preferred topology annotation for scheduling |
|
TEXT |
No |
Required topology annotation for scheduling |
|
INTEGER |
No |
Maximum number of nodes |
|
INTEGER |
No |
Scaling step size for elastic training. Provide either this or elastic-replica-discrete-values |
|
INTEGER |
No |
Graceful shutdown timeout in seconds for elastic scaling operations |
|
INTEGER |
No |
Scaling timeout for elastic training |
|
INTEGER |
No |
Timeout period after job restart during which no scale up/workload admission is allowed |
|
ARRAY |
No |
Alternative to elastic-replica-increment-step. Provides exact values for total replicas count (array of integers) |
|
FLAG |
No |
Enable debug mode (default: false) |
Volume Configuration#
The --volume parameter supports mounting different types of storage to your training containers.
Volume Syntax#
--volume name=<volume_name>,type=<volume_type>,mount_path=<mount_path>[,additional_options]
Volume Types#
hostPath Volume
--volume name=model-data,type=hostPath,mount_path=/data,path=/host/data
Persistent Volume Claim (PVC)
--volume name=training-output,type=pvc,mount_path=/output,claim_name=training-pvc,read_only=false
Volume Parameters#
Parameter |
Type |
Required |
Description |
|---|---|---|---|
|
TEXT |
Yes |
Volume name |
|
TEXT |
Yes |
Volume type ( |
|
TEXT |
Yes |
Mount path in container |
|
TEXT |
For hostPath |
Host path for hostPath volumes |
|
TEXT |
For pvc |
PVC claim name for pvc volumes |
|
BOOLEAN |
No |
Read-only flag for pvc volumes |
Training Job Management Commands#
Commands for managing PyTorch training jobs.
hyp list hyp-pytorch-job#
List all HyperPod PyTorch jobs in a namespace.
Syntax#
hyp list hyp-pytorch-job [OPTIONS]
Parameters#
Parameter |
Type |
Required |
Description |
|---|---|---|---|
|
TEXT |
No |
Namespace to list jobs from (default: “default”) |
hyp describe hyp-pytorch-job#
Describe a specific HyperPod PyTorch job.
Syntax#
hyp describe hyp-pytorch-job [OPTIONS]
Parameters#
Parameter |
Type |
Required |
Description |
|---|---|---|---|
|
TEXT |
Yes |
Name of the job to describe |
|
TEXT |
No |
Namespace of the job (default: “default”) |
hyp delete hyp-pytorch-job#
Delete a HyperPod PyTorch job.
Syntax#
hyp delete hyp-pytorch-job [OPTIONS]
Parameters#
Parameter |
Type |
Required |
Description |
|---|---|---|---|
|
TEXT |
Yes |
Name of the job to delete |
|
TEXT |
No |
Namespace of the job (default: “default”) |
hyp list-pods hyp-pytorch-job#
List all pods associated with a PyTorch job.
Syntax#
hyp list-pods hyp-pytorch-job [OPTIONS]
Parameters#
Parameter |
Type |
Required |
Description |
|---|---|---|---|
|
TEXT |
Yes |
Name of the job to list pods for |
|
TEXT |
No |
Namespace of the job (default: “default”) |
hyp get-logs hyp-pytorch-job#
Get logs from a specific pod in a PyTorch job.
Syntax#
hyp get-logs hyp-pytorch-job [OPTIONS]
Parameters#
Parameter |
Type |
Required |
Description |
|---|---|---|---|
|
TEXT |
Yes |
Name of the job |
|
TEXT |
Yes |
Name of the pod to get logs from |
|
TEXT |
No |
Namespace of the job (default: “default”) |