nccl-communication

Solid

NVIDIA Collective Communications Library integration for multi-GPU operations. Initialize NCCL communicators, execute collective operations, configure communication topologies, profile collective performance, and support RCCL for AMD compatibility.

AI & Automation 1,160 stars 71 forks Updated today MIT

Install

View on GitHub

Quality Score: 96/100

Stars 20%
100
Recency 20%
100
Frontmatter 20%
70
Documentation 15%
100
Issue Health 10%
50
License 10%
100
Description 5%
100

Skill Content

# nccl-communication You are **nccl-communication** - a specialized skill for NVIDIA Collective Communications Library (NCCL) integration. This skill provides expert capabilities for multi-GPU collective operations. ## Overview This skill enables AI-powered multi-GPU communication including: - Initialize NCCL communicators - Execute all-reduce, all-gather, reduce-scatter operations - Configure ring and tree communication topologies - Handle multi-node NCCL communication - Profile collective operation performance - Optimize for NVLink vs PCIe topology - Integrate with CUDA streams for async collectives - Support RCCL for AMD GPU compatibility ## Prerequisites - CUDA Toolkit 11.0+ - NCCL 2.10+ - Multiple GPUs (for meaningful use) - MPI (for multi-node, optional) ## Capabilities ### 1. NCCL Initialization Initialize communicators: ```c #include <nccl.h> // Single-node multi-GPU initialization int numGPUs = 4; ncclComm_t comms[4]; int devs[4] = {0, 1, 2, 3}; ncclCommInitAll(comms, numGPUs, devs); // Per-rank initialization for MPI integration ncclUniqueId id; ncclComm_t comm; if (rank == 0) { ncclGetUniqueId(&id); } MPI_Bcast(&id, sizeof(id), MPI_BYTE, 0, MPI_COMM_WORLD); cudaSetDevice(localRank); ncclCommInitRank(&comm, worldSize, id, rank); // Cleanup ncclCommDestroy(comm); ``` ### 2. All-Reduce Operations Reduce across all GPUs: ```c // Synchronous all-reduce ncclAllReduce(sendbuff, recvbuff, count, ncclFloat, ncclSum, comm, stream); cudaStreamSynchro...

Details

Author
a5c-ai
Repository
a5c-ai/babysitter
Created
4 months ago
Last Updated
today
Language
JavaScript
License
MIT

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Solid

cuda-toolkit

Deep integration with NVIDIA CUDA toolkit for kernel development, compilation, and debugging. Execute nvcc compilation with optimization flags analysis, generate and validate CUDA kernel code, analyze PTX/SASS assembly output, and configure execution parameters.

1,160 Updated today
a5c-ai
AI & Automation Solid

cublas-cudnn

Expert integration with NVIDIA GPU-accelerated math libraries. Configure cuBLAS tensor core operations, generate cuBLAS GEMM calls, integrate cuDNN layers, handle algorithm selection, and support mixed-precision operations.

1,160 Updated today
a5c-ai
AI & Automation Solid

opencl-runtime

Cross-vendor OpenCL runtime management and kernel development. Query platforms/devices, generate portable OpenCL C kernel code, handle vendor-specific extensions, manage contexts and command queues, compile and cache programs.

1,160 Updated today
a5c-ai
AI & Automation Solid

parallel-patterns

GPU parallel algorithm design patterns and implementations. Implement parallel reduction, scan/prefix sum, histogram, parallel sort algorithms, stream compaction, and work-efficient patterns optimized for specific GPU architectures.

1,160 Updated today
a5c-ai
AI & Automation Solid

cuda-graphs

Expert skill for CUDA Graph capture and optimization for reduced launch overhead. Capture CUDA operations into graphs, instantiate and execute graph instances, update graph node parameters, profile graph vs stream execution, design graph-friendly kernel patterns, and optimize launch latency for inference.

1,160 Updated today
a5c-ai