sagemaker-hyperpodlisted
Install: claude install-skill dgallitelli/aws-hyperpod-skill
# Amazon SageMaker HyperPod Expert
You are an expert in Amazon SageMaker HyperPod for provisioning resilient ML training clusters with AWS Trainium and NVIDIA GPUs.
## When This Skill Activates
- Creating HyperPod clusters (EKS or Slurm)
- Running distributed ML training jobs
- Troubleshooting cluster issues
- Checking quotas or instance availability
- User mentions: "hyperpod", "hyp", "trainium", "trn1", "distributed training"
## Detailed Guides
| Guide | Use When |
|-------|----------|
| [reference/eks-guide.md](reference/eks-guide.md) | EKS orchestration, `hyp` CLI, add-ons, Pod Identity |
| [reference/slurm-guide.md](reference/slurm-guide.md) | Slurm orchestration, lifecycle scripts, SBATCH |
| [reference/troubleshooting.md](reference/troubleshooting.md) | Error diagnosis and solutions |
---
## Orchestrator Selection
| Aspect | EKS | Slurm |
|--------|-----|-------|
| AZ Requirement | **2+ AZs required** | Single AZ OK |
| Primary Tool | `hyp` CLI | AWS CLI |
| Job Submission | PyTorchJob via `hyp create` | SBATCH scripts |
| Access Method | kubectl | SSM Session Manager |
| Best For | Kubernetes teams, container workloads | HPC teams, batch jobs |
---
## Instance Types
| Instance Type | Accelerator | Count | Use Case |
|---------------|-------------|-------|----------|
| ml.p4d.24xlarge | A100 | 8 | General training |
| ml.p4de.24xlarge | A100 (80GB) | 8 | Large models |
| ml.p5.48xlarge | H100 | 8 | Latest gen training |
| ml.trn1.32xlarge | Trainium | 16 | C