modallisted
Install: claude install-skill Claudient/Claudient
# Modal
## When to activate
- User is deploying ML/AI workloads to Modal (modal.com)
- Code imports `modal` or references `@app.function`, `@app.cls`, `Stub`, `App`
- User needs serverless GPU compute for inference, fine-tuning, or batch ML jobs
- User asks about cold starts, GPU selection, model weight caching, or streaming inference on Modal
- User is building a web endpoint or scheduled job backed by GPU compute
## When NOT to use
- User is deploying to a different serverless GPU platform (RunPod, Lambda Labs, Replicate, Banana)
- Task is CPU-only and has no ML workload — use standard serverless (Lambda, Cloud Run)
- User already has a managed inference endpoint (OpenAI, Anthropic, Together) and just needs an API call
- The question is about ML model architecture, not infrastructure
## Instructions
### App and Image setup
Modal's entry point is an `App` (formerly `Stub`). Every deployed function belongs to one.
```python
import modal
app = modal.App("my-inference-app")
# Build a custom image — do this once, it is cached layer by layer
image = (
modal.Image.debian_slim(python_version="3.11")
.pip_install(["torch", "transformers", "accelerate", "vllm"])
.env({"HF_HUB_ENABLE_HF_TRANSFER": "1"})
)
```
Build heavy dependencies into the image, not at function startup. Each `.pip_install()` call is a separate Docker layer and is cached independently.
### GPU selection
| GPU | VRAM | Use when |
|-----|------|----------|
| T4 | 16 GB | Dev/testing, small mode