cuda-graphs

Solid

Expert skill for CUDA Graph capture and optimization for reduced launch overhead. Capture CUDA operations into graphs, instantiate and execute graph instances, update graph node parameters, profile graph vs stream execution, design graph-friendly kernel patterns, and optimize launch latency for inference.

AI & Automation 1,160 stars 71 forks Updated today MIT

Install

View on GitHub

Quality Score: 96/100

Stars 20%
100
Recency 20%
100
Frontmatter 20%
70
Documentation 15%
100
Issue Health 10%
50
License 10%
100
Description 5%
100

Skill Content

# cuda-graphs You are **cuda-graphs** - a specialized skill for CUDA Graph capture and optimization. This skill provides expert capabilities for reducing kernel launch overhead and optimizing execution patterns through graph-based workflows. ## Overview This skill enables AI-powered CUDA Graph operations including: - Capturing CUDA operations into graphs - Instantiating and executing graph instances - Updating graph node parameters - Profiling graph vs stream execution - Designing graph-friendly kernel patterns - Handling conditional graph execution - Integrating graphs with NCCL operations - Optimizing launch latency for inference ## Prerequisites - NVIDIA CUDA Toolkit 10.0+ (basic graphs) - CUDA 11.0+ for graph updates - CUDA 12.0+ for conditional nodes - GPU with compute capability 7.0+ - Nsight Systems for graph profiling ## Capabilities ### 1. Stream Capture Basic Capture stream operations into a graph: ```cuda #include <cuda_runtime.h> cudaGraph_t graph; cudaGraphExec_t graphExec; cudaStream_t stream; cudaStreamCreate(&stream); // Begin stream capture cudaStreamBeginCapture(stream, cudaStreamCaptureModeGlobal); // Record operations to be captured kernel1<<<grid1, block1, 0, stream>>>(args1); kernel2<<<grid2, block2, 0, stream>>>(args2); kernel3<<<grid3, block3, 0, stream>>>(args3); // End capture and create graph cudaStreamEndCapture(stream, &graph); // Instantiate the graph for execution cudaGraphInstantiate(&graphExec, graph, NULL, NULL, 0); // Execute...

Details

Author
a5c-ai
Repository
a5c-ai/babysitter
Created
4 months ago
Last Updated
today
Language
JavaScript
License
MIT

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Solid

gpu-benchmarking

Expert skill for automated GPU performance benchmarking and regression detection. Design micro-benchmarks, measure kernel execution time with CUDA events, calculate achieved vs theoretical performance, generate comparison reports, detect regressions in CI/CD, and profile power/thermal characteristics.

1,160 Updated today
a5c-ai
AI & Automation Solid

cuda-toolkit

Deep integration with NVIDIA CUDA toolkit for kernel development, compilation, and debugging. Execute nvcc compilation with optimization flags analysis, generate and validate CUDA kernel code, analyze PTX/SASS assembly output, and configure execution parameters.

1,160 Updated today
a5c-ai
AI & Automation Solid

cuda-debugging

Expert skill for GPU debugging using CUDA-GDB and NVIDIA Compute Sanitizer. Detect memory errors, race conditions, uninitialized memory access, validate atomic operations, analyze kernel synchronization issues, and generate debugging reports with recommendations.

1,160 Updated today
a5c-ai
AI & Automation Solid

parallel-patterns

GPU parallel algorithm design patterns and implementations. Implement parallel reduction, scan/prefix sum, histogram, parallel sort algorithms, stream compaction, and work-efficient patterns optimized for specific GPU architectures.

1,160 Updated today
a5c-ai
AI & Automation Solid

nsight-profiler

Expert skill for NVIDIA Nsight Systems and Nsight Compute profiling tools. Configure profiling sessions, analyze kernel reports, interpret occupancy metrics, roofline model data, memory bandwidth bottlenecks, and warp execution efficiency.

1,160 Updated today
a5c-ai