parallel-patterns

Solid

GPU parallel algorithm design patterns and implementations. Implement parallel reduction, scan/prefix sum, histogram, parallel sort algorithms, stream compaction, and work-efficient patterns optimized for specific GPU architectures.

AI & Automation 1,160 stars 71 forks Updated today MIT

Install

View on GitHub

Quality Score: 96/100

Stars 20%
100
Recency 20%
100
Frontmatter 20%
70
Documentation 15%
100
Issue Health 10%
50
License 10%
100
Description 5%
100

Skill Content

# parallel-patterns You are **parallel-patterns** - a specialized skill for GPU parallel algorithm design patterns and implementations. This skill provides expert capabilities for implementing efficient parallel algorithms on GPUs. ## Overview This skill enables AI-powered parallel algorithm development including: - Implement parallel reduction algorithms (tree-based, warp) - Generate scan (prefix sum) implementations - Design histogram and binning algorithms - Implement parallel sort algorithms (radix, merge) - Generate stream compaction code - Design work-efficient parallel patterns - Handle multi-pass large-data algorithms - Optimize for specific GPU architectures ## Prerequisites - CUDA Toolkit 11.0+ - CUB library (included with CUDA) - Thrust library (included with CUDA) ## Capabilities ### 1. Parallel Reduction Implement efficient reductions: ```cuda // Warp-level reduction (no shared memory needed for single warp) __device__ float warpReduce(float val) { for (int offset = warpSize / 2; offset > 0; offset >>= 1) { val += __shfl_down_sync(0xffffffff, val, offset); } return val; } // Block-level reduction with shared memory template<int BLOCK_SIZE> __device__ float blockReduce(float val) { __shared__ float shared[32]; // One slot per warp int lane = threadIdx.x % warpSize; int wid = threadIdx.x / warpSize; // Warp-level reduction val = warpReduce(val); // Write warp results to shared memory if (lane == 0) share...

Details

Author
a5c-ai
Repository
a5c-ai/babysitter
Created
4 months ago
Last Updated
today
Language
JavaScript
License
MIT

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Solid

warp-primitives

Warp-level programming and SIMD optimization. Use warp shuffle instructions, voting functions, cooperative groups, warp-synchronous algorithms, and minimize warp divergence for optimal GPU performance.

1,160 Updated today
a5c-ai
AI & Automation Solid

gpu-memory-analysis

Specialized skill for GPU memory hierarchy analysis and optimization. Analyze memory access patterns, detect bank conflicts, optimize cache utilization, profile global memory bandwidth, and generate optimized memory access code patterns.

1,160 Updated today
a5c-ai
AI & Automation Solid

cuda-graphs

Expert skill for CUDA Graph capture and optimization for reduced launch overhead. Capture CUDA operations into graphs, instantiate and execute graph instances, update graph node parameters, profile graph vs stream execution, design graph-friendly kernel patterns, and optimize launch latency for inference.

1,160 Updated today
a5c-ai
AI & Automation Solid

stencil-convolution

Expert skill for optimized stencil and convolution pattern implementations on GPU. Design tiled stencil algorithms with halos, implement 2D/3D convolution kernels, optimize boundary condition handling, apply temporal blocking techniques, generate separable filter implementations, and profile stencil memory bandwidth.

1,160 Updated today
a5c-ai
AI & Automation Solid

cuda-toolkit

Deep integration with NVIDIA CUDA toolkit for kernel development, compilation, and debugging. Execute nvcc compilation with optimization flags analysis, generate and validate CUDA kernel code, analyze PTX/SASS assembly output, and configure execution parameters.

1,160 Updated today
a5c-ai