GPU-Initiated Networking for NCCL

arXiv — cs.LG•Wednesday, November 26, 2025 at 5:00:00 AM

PositiveArtificial Intelligence

The introduction of GPU-Initiated Networking (GIN) in NCCL 2.28 marks a significant advancement in GPU-to-GPU communication, particularly for Mixture-of-Experts (MoE) architectures. This new Device API allows for low-latency communication directly between GPUs, bypassing CPU coordination, which is traditionally required in CUDA-based systems.
This development is crucial as it enhances the efficiency of AI workloads that rely on rapid data exchange between GPUs, thereby improving performance in applications that require tight integration of computation and communication, such as MoE models.
The evolution towards device-initiated communication reflects a broader trend in AI and machine learning towards optimizing resource utilization and reducing latency. Innovations like AutoSAGE and CLO further illustrate the industry's focus on enhancing computational efficiency and scalability, particularly in the context of large language models and sparse graph neural networks.

— via World Pulse Now AI Editorial System

Read Original

Was this article worth reading? Share it

Cato Networks

Next-gen firewall built into the network fabric with app-level policies and URL filtering.

AI & DataView app details

LucidQuery AI

Combines diffusion reasoning with autoregressive LLM for advanced AI analysis.

AI & DataView app details

Aqaba.ai

High-performance GPU cloud instances for demanding AI workloads and data processing.

AI & DataView app details

Https

Access multiple AI models seamlessly in one unified chat application.

AI & DataView app details

Bifrost

Bifrost is the fastest LLM gateway with just 11μs overhead at 5,000 RPS.

Business & ProductivityView app details

StateMesh

Deploy containers, VMs, and serverless functions on decentralized, censorship-resistant cloud infrastructure.

AI & DataView app details

Continue Readings

arXiv — cs.CV2 days ago

TPG-INR: Target Prior-Guided Implicit 3D CT Reconstruction for Enhanced Sparse-view Imaging

PositiveArtificial Intelligence

A novel framework named TPG-INR has been proposed for 3D CT reconstruction, which utilizes a 'target prior' derived from projection data to enhance implicit learning in ultra-sparse view scenarios. This approach integrates positional and structural encoding to improve voxel-wise reconstruction quality and efficiency, addressing limitations of existing methods that often overlook anatomical priors.

Read full article

via arXiv — cs.CV

arXiv — cs.LG2 days ago

LLMQ: Efficient Lower-Precision Pretraining for Consumer GPUs

PositiveArtificial Intelligence

LLMQ has been introduced as an efficient end-to-end CUDA/C++ implementation for training medium-sized language models, specifically targeting consumer-grade GPUs with limited memory and slower communication capabilities. This system enables the training of models ranging from 3B to 32B parameters, achieving significant performance on affordable hardware.

Read full article

via arXiv — cs.LG

arXiv — cs.LG2 days ago

PyGraph: Robust Compiler Support for CUDA Graphs in PyTorch

PositiveArtificial Intelligence

PyGraph has been introduced as a robust compiler framework designed to enhance the deployment of CUDA Graphs in PyTorch, addressing the challenges of kernel launch latency in machine learning workloads. By implementing automatic code transformations and eliminating parameter copy overheads, PyGraph aims to significantly improve the efficiency of ML applications.

Read full article

via arXiv — cs.LG

Ready to build your own newsroom?

Subscribe to unlock a personalised feed, podcasts, newsletters, and notifications tailored to the topics you actually care about