NVIDIA Launches Nemotron 3 Nano Omni Multimodal AI Model

NVIDIA unveils Nemotron 3 Nano Omni, an open multimodal model combining vision, audio, and language capabilities for more efficient AI agents.

NVIDIA announced the launch of Nemotron 3 Nano Omni, an open multimodal model that unifies vision, audio, image, and text understanding into a single system, according to blogs.nvidia.com. The model is designed to enable AI agents to deliver faster responses with advanced reasoning across multiple data types.

According to aws.amazon.com, Nemotron 3 Nano Omni features 30 billion total parameters with 3 billion active parameters (30B A3B) and is built on a Mamba2 Transformer Hybrid Mixture of Experts (MoE) architecture. The model combines three components: Nemotron 3 Nano LLM as the language backbone, CRADIO v4-H as the vision encoder, and Parakeet as the speech encoder. It supports a 131K token context length, chain of thought reasoning, tool calling, JSON output, and word-level timestamps for transcription tasks.

According to blogs.nvidia.com, Nemotron 3 Nano Omni “sets a new efficiency frontier for open multimodal models with leading accuracy and low cost,” topping six leaderboards for complex document intelligence and video and audio understanding. Companies adopting the model include Aible, Applied Scientific Intelligence, Eka Care, Foxconn, H Company, Palantir, and Pyler, with Dell Technologies, Docusign, Infosys, K-Dense, Lila, Oracle, and Zefr evaluating it.

The model is now available on Amazon SageMaker JumpStart, according to aws.amazon.com, and is licensed under the NVIDIA Open Model Agreement for commercial use.