hyperpod-version-checker

Solid

Check and compare software component versions on SageMaker HyperPod cluster nodes - NVIDIA drivers, CUDA toolkit, cuDNN, NCCL, EFA, AWS OFI NCCL, GDRCopy, MPI, Neuron SDK (Trainium/Inferentia), Python, and PyTorch. Use when checking component versions, verifying CUDA/driver compatibility, detecting version mismatches across nodes, planning upgrades, documenting cluster configuration, or troubleshooting version-related issues on HyperPod. Triggers on requests about versions, compatibility, component checks, or upgrade planning for HyperPod clusters.

AI & Automation 784 stars 115 forks Updated today Apache-2.0

Install

View on GitHub

Quality Score: 95/100

Stars 20%
96
Recency 20%
100
Frontmatter 20%
70
Documentation 15%
100
Issue Health 10%
50
License 10%
100
Description 5%
100

Skill Content

# HyperPod Version Checker Upload to cluster nodes via `hyperpod-ssm` skill, then execute. ## Usage ```bash # Text report to console + file bash hyperpod_check_versions.sh # JSON only to stdout (text report still saved to file) — best for piping/parsing bash hyperpod_check_versions.sh --json # Custom output file bash hyperpod_check_versions.sh --output /tmp/versions.txt # No color (for logging) bash hyperpod_check_versions.sh --no-color ``` Output file: `component_versions_<hostname>_<timestamp>.txt` (default) ## What It Checks | Component | Detection Method | Applicable When | | ----------------- | ----------------------------------------------- | --------------------------------------------- | | NVIDIA Driver | `nvidia-smi` | GPU instances (p3/p4/p5/g5) | | CUDA Toolkit | `nvcc`, `/usr/local/cuda` symlink | GPU instances | | cuDNN | Header file, packages | GPU instances doing deep learning | | NCCL | Library filename, header, packages | Distributed GPU training | | EFA | `/opt/amazon/efa_installed_packages`, `fi_info` | EFA-capable instances (p4d/p4de/p5/trn1/trn2) | | AWS OFI NCCL | `efa_installed_packages`, library search | EFA + NCCL workloads ...

Details

Author
awslabs
Repository
awslabs/agent-plugins
Created
4 months ago
Last Updated
today
Language
Shell
License
Apache-2.0

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Listed

sagemaker-hyperpod

Amazon SageMaker HyperPod expert for ML training clusters with Trainium or GPU. Use when: creating HyperPod clusters, running distributed training, configuring EKS or Slurm orchestration, troubleshooting cluster issues, checking quotas, or when user mentions "hyperpod", "hyp", "ml-cluster", "trainium", "trn1", "distributed training", or "multi-node training".

5 Updated 4 months ago
dgallitelli
Data & Documents Solid

hyperpod-issue-report

Generate comprehensive issue reports from HyperPod clusters (EKS and Slurm) by collecting diagnostic logs and configurations for troubleshooting and AWS Support cases. Use when users need to collect diagnostics from HyperPod cluster nodes, generate issue reports for AWS Support, investigate node failures or performance problems, document cluster state, or create diagnostic snapshots. Triggers on requests involving issue reports, diagnostic collection, support case preparation, or cluster troubleshooting that requires gathering logs and system information from multiple nodes.

784 Updated today
awslabs
AI & Automation Solid

hyperpod-ssm

Remote command execution and file transfer on SageMaker HyperPod cluster nodes via AWS Systems Manager (SSM). This is the primary interface for accessing HyperPod nodes — direct SSH is not available. Use when any skill, workflow, or user request needs to execute commands on cluster nodes, upload files to nodes, read/download files from nodes, run diagnostics, install packages, or perform any operation requiring shell access to HyperPod instances. Other HyperPod skills depend on this skill for all node-level operations.

784 Updated today
awslabs