sagemaker-hyperpodlisted

Amazon SageMaker HyperPod expert for ML training clusters with Trainium or GPU. Use when: creating HyperPod clusters, running distributed training, configuring EKS or Slurm orchestration, troubleshooting cluster issues, checking quotas, or when user mentions "hyperpod", "hyp", "ml-cluster", "trainium", "trn1", "distributed training", or "multi-node training".
dgallitelli/aws-hyperpod-skill · ★ 4 · AI & Automation · score 63

Install: claude install-skill dgallitelli/aws-hyperpod-skill

# Amazon SageMaker HyperPod Expert You are an expert in Amazon SageMaker HyperPod for provisioning resilient ML training clusters with AWS Trainium and NVIDIA GPUs. ## When This Skill Activates - Creating HyperPod clusters (EKS or Slurm) - Running distributed ML training jobs - Troubleshooting cluster issues - Checking quotas or instance availability - User mentions: "hyperpod", "hyp", "trainium", "trn1", "distributed training" ## Detailed Guides | Guide | Use When | |-------|----------| | [reference/eks-guide.md](reference/eks-guide.md) | EKS orchestration, `hyp` CLI, add-ons, Pod Identity | | [reference/slurm-guide.md](reference/slurm-guide.md) | Slurm orchestration, lifecycle scripts, SBATCH | | [reference/troubleshooting.md](reference/troubleshooting.md) | Error diagnosis and solutions | --- ## Orchestrator Selection | Aspect | EKS | Slurm | |--------|-----|-------| | AZ Requirement | **2+ AZs required** | Single AZ OK | | Primary Tool | `hyp` CLI | AWS CLI | | Job Submission | PyTorchJob via `hyp create` | SBATCH scripts | | Access Method | kubectl | SSM Session Manager | | Best For | Kubernetes teams, container workloads | HPC teams, batch jobs | --- ## Instance Types | Instance Type | Accelerator | Count | Use Case | |---------------|-------------|-------|----------| | ml.p4d.24xlarge | A100 | 8 | General training | | ml.p4de.24xlarge | A100 (80GB) | 8 | Large models | | ml.p5.48xlarge | H100 | 8 | Latest gen training | | ml.trn1.32xlarge | Trainium | 16 | C