NVIDIA C-RADIOv4: A Unified Vision Backbone for Scale

Published

2026-02-08 08:00

NVIDIA has announced the release of C-RADIOv4, a new “agglomerative” vision backbone that unifies three powerful architectures—SigLIP2, DINOv3, and SAM3—into a single student model. This update represents a significant step forward in building versatile AI models that can handle classification, dense prediction, and segmentation at scale without needing specialized encoders for each task. ## The core of C-RADIOv4’s success lies in its distillation process. By training a single Vision Transformer (ViT) student to match the dense feature maps and summary tokens of heterogeneous teacher models, NVIDIA has created a backbone that captures the best of three worlds: * SigLIP2-g-384: Provides superior image-text alignment for retrieval and classification. * DINOv3-7B: Offers high-quality self-supervised features for dense spatial tasks. * SAM3: Enables robust segmentation capabilities and drop-in compatibility with the latest Segment Anything decoders. ### Breakthrough in Resolution Robustness One of the most challenging aspects of vision models is maintaining performance across different input sizes. C-RADIOv4 introduces stochastic multi-resolution training, sampling inputs from 128px up to 1152px. Coupled with the FeatSharp upsampling technique, this ensures that the model remains accurate whether processing a small thumbnail or a high-resolution medical image. ### Solving the “Artifact” Problem Distilling from large models often results in the student copying the teacher’s “noise” or border artifacts. NVIDIA solved this through shift-equivariant losses. By showing the teacher and student different, independently shifted crops of the same image, the system forces the student to learn genuine semantic structures rather than memorizing position-fixed noise patterns. ### Deployment and Accessibility C-RADIOv4 is designed for practical use, featuring a ViTDet-mode for efficient inference. On an A100 GPU, the student model’s windowed attention mechanism allows it to outperform the original SAM3 ViT-L+ encoder in speed while maintaining competitive accuracy. The model has been released under the NVIDIA Open Model License, making it a powerful resource for researchers and enterprises looking to streamline their computer vision pipelines. [Technical Paperhttps://arxiv.org/abs/2601.17237){rel=“nofollow”} | [Model on Hugging Facehttps://huggingface.co/nvidia/C-RADIOv4-H){rel=“nofollow”}