hyperpod-ssm

Solid

Remote command execution and file transfer on SageMaker HyperPod cluster nodes via AWS Systems Manager (SSM). This is the primary interface for accessing HyperPod nodes — direct SSH is not available. Use when any skill, workflow, or user request needs to execute commands on cluster nodes, upload files to nodes, read/download files from nodes, run diagnostics, install packages, or perform any operation requiring shell access to HyperPod instances. Other HyperPod skills depend on this skill for all node-level operations.

AI & Automation 784 stars 115 forks Updated today Apache-2.0

Install

View on GitHub

Quality Score: 95/100

Stars 20%

Recency 20%

100

Frontmatter 20%

Documentation 15%

100

Issue Health 10%

License 10%

100

Description 5%

100

Skill Content

# HyperPod SSM Access ## SSM Target Format Target: `sagemaker-cluster:<CLUSTER_ID>_<GROUP_NAME>-<INSTANCE_ID>` - `CLUSTER_ID`: Last segment of cluster ARN (NOT the cluster name). Extract via `get-cluster-info.sh`. - `GROUP_NAME`: Instance group name — retrieve via `list-nodes.sh`. - `INSTANCE_ID`: EC2 instance ID (e.g., `i-0123456789abcdef0`) ## Scripts Three scripts under `scripts/`. Resolve cluster info and nodes **once**, then execute per node. ### get-cluster-info.sh — Resolve cluster name → ID (call once) ```bash scripts/get-cluster-info.sh CLUSTER_NAME [--region REGION] # Output: {"cluster_id":"...","cluster_arn":"...","cluster_name":"...","region":"..."} ``` ### list-nodes.sh — List all nodes with pagination (call once) ```bash scripts/list-nodes.sh CLUSTER_NAME [--region REGION] [--instance-group GROUP] [--instance-id ID] # Output: JSON array of ClusterNodeSummaries (InstanceId, InstanceGroupName, InstanceStatus, etc.) ``` `list-cluster-nodes` paginates at 100 nodes. This script handles pagination automatically. ### ssm-exec.sh — Execute command on a node (call per node) ```bash # Execute — with pre-built target scripts/ssm-exec.sh --target "sagemaker-cluster:CLUSTERID_GROUP-INSTANCEID" 'command' [--region REGION] # Execute — with parts scripts/ssm-exec.sh --cluster-id ID --group GROUP --instance-id INSTANCE_ID 'command' [--region REGION] # Upload scripts/ssm-exec.sh --target TARGET --upload LOCAL_PATH REMOTE_PATH [--region REGION] # Read remote file sc...

Details

Author: awslabs
Repository: awslabs/agent-plugins
Created: 4 months ago
Last Updated: today
Language: Shell
License: Apache-2.0

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Listed

sagemaker-hyperpod

Amazon SageMaker HyperPod expert for ML training clusters with Trainium or GPU. Use when: creating HyperPod clusters, running distributed training, configuring EKS or Slurm orchestration, troubleshooting cluster issues, checking quotas, or when user mentions "hyperpod", "hyp", "ml-cluster", "trainium", "trn1", "distributed training", or "multi-node training".

5 Updated 4 months ago

dgallitelli

Data & Documents Solid

hyperpod-issue-report

Generate comprehensive issue reports from HyperPod clusters (EKS and Slurm) by collecting diagnostic logs and configurations for troubleshooting and AWS Support cases. Use when users need to collect diagnostics from HyperPod cluster nodes, generate issue reports for AWS Support, investigate node failures or performance problems, document cluster state, or create diagnostic snapshots. Triggers on requests involving issue reports, diagnostic collection, support case preparation, or cluster troubleshooting that requires gathering logs and system information from multiple nodes.

784 Updated today

awslabs

AI & Automation Solid

skillshare

Manages and syncs AI CLI skills and agents across 50+ tools from a single source. Use this skill whenever the user mentions "skillshare", runs skillshare commands, manages skills or agents (install, update, uninstall, sync, commit, audit, analyze, check, diff, search), or troubleshoots skill/agent configuration (orphaned symlinks, broken targets, sync issues). Covers both global (~/.config/skillshare/) and project (.skillshare/) modes. Also use when: adding new AI tool targets (Claude, Cursor, Windsurf, etc.), setting target include/exclude filters or copy vs symlink mode, using backup/restore or trash recovery, piping skillshare output to scripts (--json), setting up CI/CD audit pipelines, building/sharing skill hubs (hub index, hub add), or working with agents (single .md files synced to agent-capable targets like Claude, Cursor, Augment, OpenCode) via positional `agents` filter or `--kind agent`, plus `.agentignore` and `enable`/`disable` for per-agent toggles.

2,204 Updated yesterday

runkids