warp-primitives

Solid

Warp-level programming and SIMD optimization. Use warp shuffle instructions, voting functions, cooperative groups, warp-synchronous algorithms, and minimize warp divergence for optimal GPU performance.

AI & Automation 1,160 stars 71 forks Updated today MIT

Install

View on GitHub

Quality Score: 96/100

Stars 20%
100
Recency 20%
100
Frontmatter 20%
70
Documentation 15%
100
Issue Health 10%
50
License 10%
100
Description 5%
100

Skill Content

# warp-primitives You are **warp-primitives** - a specialized skill for warp-level programming and SIMD optimization on GPUs. This skill provides expert capabilities for low-level GPU performance optimization. ## Overview This skill enables AI-powered warp-level programming including: - Use warp shuffle instructions (__shfl_*) - Implement warp voting functions (__ballot, __any, __all) - Design warp-synchronous algorithms - Optimize warp divergence patterns - Use cooperative groups for flexible sync - Implement warp-level reductions - Analyze and minimize warp stalls - Support CUDA 11+ warp intrinsics ## Prerequisites - CUDA Toolkit 11.0+ - GPU with compute capability 3.0+ - Understanding of SIMT execution model ## Capabilities ### 1. Warp Shuffle Instructions Data exchange within a warp: ```cuda // __shfl_sync: Broadcast from any lane __device__ float warpBroadcast(float val, int srcLane) { return __shfl_sync(0xffffffff, val, srcLane); } // __shfl_up_sync: Shift up (for inclusive scan) __device__ float shflUp(float val, int delta) { return __shfl_up_sync(0xffffffff, val, delta); } // __shfl_down_sync: Shift down (for reduction) __device__ float shflDown(float val, int delta) { return __shfl_down_sync(0xffffffff, val, delta); } // __shfl_xor_sync: Butterfly pattern (for reduction) __device__ float shflXor(float val, int laneMask) { return __shfl_xor_sync(0xffffffff, val, laneMask); } // Warp-level reduction using shuffle __device__ float warpReduce...

Details

Author
a5c-ai
Repository
a5c-ai/babysitter
Created
4 months ago
Last Updated
today
Language
JavaScript
License
MIT

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Solid

parallel-patterns

GPU parallel algorithm design patterns and implementations. Implement parallel reduction, scan/prefix sum, histogram, parallel sort algorithms, stream compaction, and work-efficient patterns optimized for specific GPU architectures.

1,160 Updated today
a5c-ai
AI & Automation Solid

gpu-memory-analysis

Specialized skill for GPU memory hierarchy analysis and optimization. Analyze memory access patterns, detect bank conflicts, optimize cache utilization, profile global memory bandwidth, and generate optimized memory access code patterns.

1,160 Updated today
a5c-ai
AI & Automation Solid

cuda-toolkit

Deep integration with NVIDIA CUDA toolkit for kernel development, compilation, and debugging. Execute nvcc compilation with optimization flags analysis, generate and validate CUDA kernel code, analyze PTX/SASS assembly output, and configure execution parameters.

1,160 Updated today
a5c-ai
AI & Automation Solid

unified-memory

Expert skill for CUDA Unified Memory and memory prefetching optimization. Configure managed memory allocations, implement memory prefetch strategies, handle page fault analysis, configure memory hints and advise, profile unified memory migration, optimize for oversubscription scenarios, and compare managed vs explicit memory.

1,160 Updated today
a5c-ai
AI & Automation Solid

stencil-convolution

Expert skill for optimized stencil and convolution pattern implementations on GPU. Design tiled stencil algorithms with halos, implement 2D/3D convolution kernels, optimize boundary condition handling, apply temporal blocking techniques, generate separable filter implementations, and profile stencil memory bandwidth.

1,160 Updated today
a5c-ai