cutlass-triton

Solid

High-performance kernel template libraries and DSLs. Generate CUTLASS GEMM configurations, implement Triton kernel definitions, configure epilogue operations, tune tile sizes and warp arrangements, and benchmark against cuBLAS.

AI & Automation 1,160 stars 71 forks Updated today MIT

Install

View on GitHub

Quality Score: 96/100

Stars 20%
100
Recency 20%
100
Frontmatter 20%
70
Documentation 15%
100
Issue Health 10%
50
License 10%
100
Description 5%
100

Skill Content

# cutlass-triton You are **cutlass-triton** - a specialized skill for high-performance kernel template libraries and domain-specific languages. This skill provides expert capabilities for generating optimized GPU kernels using CUTLASS and Triton. ## Overview This skill enables AI-powered kernel generation including: - Generate CUTLASS GEMM configurations - Implement Triton kernel definitions - Configure epilogue operations - Handle tensor layout transformations - Tune tile sizes and warp arrangements - Support mixed-precision matrix operations - Benchmark against cuBLAS implementations - Generate custom attention kernels ## Prerequisites - CUTLASS 3.0+ (header-only library) - Triton 2.0+ (Python package) - CUDA Toolkit 11.0+ - Python 3.8+ (for Triton) ## Capabilities ### 1. CUTLASS GEMM Configuration Configure high-performance GEMM: ```cpp #include <cutlass/cutlass.h> #include <cutlass/gemm/device/gemm.h> // Define GEMM operation types using ElementA = cutlass::half_t; using ElementB = cutlass::half_t; using ElementC = cutlass::half_t; using ElementAccumulator = float; using LayoutA = cutlass::layout::RowMajor; using LayoutB = cutlass::layout::ColumnMajor; using LayoutC = cutlass::layout::RowMajor; // Define CUTLASS GEMM using Gemm = cutlass::gemm::device::Gemm< ElementA, LayoutA, ElementB, LayoutB, ElementC, LayoutC, ElementAccumulator, cutlass::arch::OpClassTensorOp, cutlass::arch::Sm80, cutlass::gemm::GemmShape<128, 256, 64>, // Thr...

Details

Author
a5c-ai
Repository
a5c-ai/babysitter
Created
4 months ago
Last Updated
today
Language
JavaScript
License
MIT

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Solid

cublas-cudnn

Expert integration with NVIDIA GPU-accelerated math libraries. Configure cuBLAS tensor core operations, generate cuBLAS GEMM calls, integrate cuDNN layers, handle algorithm selection, and support mixed-precision operations.

1,160 Updated today
a5c-ai
AI & Automation Solid

cuda-toolkit

Deep integration with NVIDIA CUDA toolkit for kernel development, compilation, and debugging. Execute nvcc compilation with optimization flags analysis, generate and validate CUDA kernel code, analyze PTX/SASS assembly output, and configure execution parameters.

1,160 Updated today
a5c-ai
Data & Documents Listed

kernel-generator

Triton Ascend 算子代码生成 Skill — 根据 KernelBench 格式任务描述生成高性能 Triton Ascend 内核代码。支持首次生成和基于错误反馈的迭代优化。

27 Updated 2 days ago
Just-it
AI & Automation Solid

triton-inference-config

Configure triton inference config operations. Auto-activating skill for ML Deployment. Triggers on: triton inference config, triton inference config Part of the ML Deployment skill category. Use when configuring systems or services. Trigger with phrases like "triton inference config", "triton config", "triton".

2,274 Updated today
jeremylongshore
AI & Automation Solid

parallel-patterns

GPU parallel algorithm design patterns and implementations. Implement parallel reduction, scan/prefix sum, histogram, parallel sort algorithms, stream compaction, and work-efficient patterns optimized for specific GPU architectures.

1,160 Updated today
a5c-ai