skypilot-multi-cloud-orchestration

Solid

Multi-cloud orchestration for ML workloads with automatic cost optimization. Use when you need to run training or batch jobs across multiple clouds, leverage spot instances with auto-recovery, or optimize GPU costs across providers.

AI & Automation 9,182 stars 697 forks Updated 1 months ago MIT

Install

View on GitHub

Quality Score: 94/100

Stars 20%
100
Recency 20%
75
Frontmatter 20%
70
Documentation 15%
100
Issue Health 10%
50
License 10%
100
Description 5%
100

Skill Content

# SkyPilot Multi-Cloud Orchestration Comprehensive guide to running ML workloads across clouds with automatic cost optimization using SkyPilot. ## When to use SkyPilot **Use SkyPilot when:** - Running ML workloads across multiple clouds (AWS, GCP, Azure, etc.) - Need cost optimization with automatic cloud/region selection - Running long jobs on spot instances with auto-recovery - Managing distributed multi-node training - Want unified interface for 20+ cloud providers - Need to avoid vendor lock-in **Key features:** - **Multi-cloud**: AWS, GCP, Azure, Kubernetes, Lambda, RunPod, 20+ providers - **Cost optimization**: Automatic cheapest cloud/region selection - **Spot instances**: 3-6x cost savings with automatic recovery - **Distributed training**: Multi-node jobs with gang scheduling - **Managed jobs**: Auto-recovery, checkpointing, fault tolerance - **Sky Serve**: Model serving with autoscaling **Use alternatives instead:** - **Modal**: For simpler serverless GPU with Python-native API - **RunPod**: For single-cloud persistent pods - **Kubernetes**: For existing K8s infrastructure - **Ray**: For pure Ray-based orchestration ## Quick start ### Installation ```bash pip install "skypilot[aws,gcp,azure,kubernetes]" # Verify cloud credentials sky check ``` ### Hello World Create `hello.yaml`: ```yaml resources: accelerators: T4:1 run: | nvidia-smi echo "Hello from SkyPilot!" ``` Launch: ```bash sky launch -c hello hello.yaml # SSH to cluster ssh hello # Term...

Details

Author
Orchestra-Research
Repository
Orchestra-Research/AI-Research-SKILLs
Created
7 months ago
Last Updated
1 months ago
Language
TeX
License
MIT

Integrates with

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Solid

skypilot-multi-cloud-orchestration

Multi-cloud orchestration for ML workloads with automatic cost optimization. Use when you need to run training or batch jobs across multiple clouds, leverage spot instances with auto-recovery, or optimize GPU costs across providers.

1,436 Updated 6 days ago
OpenRaiser
AI & Automation Solid

skypilot-multi-cloud-orchestration

Multi-cloud orchestration for ML workloads with automatic cost optimization. Use when you need to run training or batch jobs across multiple clouds, leverage spot instances with auto-recovery, or optimize GPU costs across providers.

2,210 Updated 1 weeks ago
foryourhealth111-pixel
DevOps & Infrastructure Featured

skypilot-multi-cloud-orchestration

Multi-cloud orchestration for ML workloads with automatic cost optimization. Use when you need to run training or batch jobs across multiple clouds, leverage spot instances with auto-recovery, or optimize GPU costs across providers.

27,705 Updated today
davila7
AI & Automation Solid

lambda-labs-gpu-cloud

Reserved and on-demand GPU cloud instances for ML training and inference. Use when you need dedicated GPU instances with simple SSH access, persistent filesystems, or high-performance multi-node clusters for large-scale training.

175,435 Updated today
NousResearch
DevOps & Infrastructure Featured

lambda-labs-gpu-cloud

Reserved and on-demand GPU cloud instances for ML training and inference. Use when you need dedicated GPU instances with simple SSH access, persistent filesystems, or high-performance multi-node clusters for large-scale training.

27,705 Updated today
davila7